1.4. A simple model of speech processing

Levelt (1995) extensively discussed the issue of whether there is a pre-articulatory monitoring: Does a person, during normal spontaneous speech, check his/her words and sentences before articulation? Can the speaker detect speech errors or inappropriate wording, and suppress or correct them before speaking aloud? The issue is theoretically relevant because an evidence of a pre-articulatory monitoring would also be an evidence of the hypothesized ‘phonetic plans’, which would be inconsistent with Engelkamp and Rummer’s (1999) concept of speaking programs controlling articulation (see Section 1.2.). The concept of speaking programs, however, is basic for the present theory of stuttering.

Levelt affirmed the existence of a pre-articulatory monitoring and justified his position by referring to several experiments, in which, for example, participants apparently detected wrong words timely before articulation and suppressed them (Baars, Motley, & MacKay, 1975; Motley, Camden, & Baars, 1982). In these experiments, however, the decision of whether the wrong word was spoken out or not was made before overt speech started – which is crucial in our context: Even if a pre-articulatory monitoring happened in these experiments, it is not an evidence for the internal feedback loop to work during overt speech and normal hearing. But I doubt whether a pre-articulatory monitoring ever happened in those experiments – spontaneous avoidance of wrong or inappropriate words can also be explained otherwise than by an incessant internal monitoring (read more).

In order to justify his belief that pre-articulatory monitoring plays a role in normal speech, Levelt (1995) referred to a further experiment, Lackner and Tuller (1979), in which two things had come out: (1) error detection via the internal loop was faster than via the external one, and (2) phonological errors were detected more quickly by the speaker than by a listener. The authors concluded from these results that the cause why the speakers detected their own errors more quickly was: They detected them (also) via the internal feedback loop. That may be, but I think, it is not the internal perception of a ‘phonetic plan’ which allows the speaker to detect his errors more quickly than a listener. It may rather be the tactile and kinesthetic feedback of articulation – an information inaccessible for a listener. Self-perception of articulation, however, is just not a pre-articulatory monitoring (read more).

Undoubtedly, pre-articulatory monitoring is possible: We can formulate on trial internally before speaking aloud. But this is not a spontaneous speech. Normally, we behave so only in situations, in which it is absolutely necessary to avoid saying the wrong thing, for instance, in examinations and in political or commercial negotiations. Some stutterers, however, behave so in an attempt to avoid words with certain initial sounds on which they fear stuttering. By contrast, normal spontaneous speech is characterized by formulations being not internally checked before they are spoken out, with the result that errors and corrections are common in normal speech.

Questions like that whether an internal pre-articulatory monitoring of speech does exist or not exists may not appear to be very relevant to the question of how stuttering is caused. But remember that one of the most famous and influential theories on the cause of stuttering – the Covert Repair Hypothesis by Postma and Kolk ( 1993) – depends on the premise that formulation errors are detected by an unconscious pre-articulatory monitoring. After this theory, stuttering is thought to be caused by covert repairs of those unconscious errors.

In the following diaram, the Figures 2 and 3 have been combined to a simple model of speech production. This model is a synthesis from Levelt’s model with its two feedback loops, but without the separation of formulation and articulation (see footnote in Section 1.1), and Engelkamp and Rummer’s (1999) model of word forms and speaking programs (see Section 1.2). Note that not all functions depicted in Figure&nbap;4 can be active at the same time; this is the case for the external / internal auditory feedback as well as for self-formulated speech / phonological repetition. Concepts, in this model, are not a third kind of the representation of words in addition to acoustic word forms and speaking programs. Instead, concepts (contents of words) are links between word forms / spaeking programs and non-linguistic representations (see footnote in Section 1.2).

Model of speech production and -perception

Figure 4: Model of speech processing. Note that not all functions are active at the same time. Orange arrows: functions during normal, spontaneous speech.

An example of a pure phonological repetition of speech without semantic comprehension is the repetition of nonwords (also referred to as pseudowords). Normally, a speaker has no speaking program for a nonword, hence the articulatory sequence must be composed of speaking programs for familiar syllables or phonemes. The more familiar those syllables and sounds are to the speaker, the easier the repetition of the nonword. The example of nonword repetition shows that a kind of phonological encoding is quite a part of my model of speech production:,the derivation of a motor sequence from an acoustic sequence. But, different from Levelt, (i) I do not assume that it is a computational process – I rather assume that it works analogously, therefore, I doubt whether the term ‘encoding’ is appropriate, and (ii) it is merely a special case that is usually not needed in everyday talking, but essential for learning new words.

Talking is a routine behavior. This is also true for conjugating and declining, that are, at least in one’s native language, rather motor routines than a matter of thinking or even computation in the brain. Therefore, I do not believe that lemmas are converted into lexemes, e.g., “father” into “father’s”. I think, adding /s/ simply is a speaking program, a learned motor reaction to a certain speech situation, namely the situation I want to express that something belongs to the father. The difference between Levelt’s model of speech production and the model that I propose here mainly results from the fact that Levelt’s model is strongly influenced by the artificial intelligence research in the last decades of the 20th century, that is, by the idea the brain is the biological version of a computer, and we could figure out how humans produce speech by thinking about how to design a speaking machine. That was a fundamental error.


to the top

next page


Pre-articulatory monitoring

In those experiments, subjects were tempted to make certain speech errors in the following way: Immediately before they spoke a pair of target words, they had to silently read a priming list of word pairs that sounded alike the speech error intended by the experimenters. The priming lists and the pairs of target words were designed so that, with high probability, nonwords in the one experiment (Baars, Motley, & MacKay, 1975) and taboo words in the other experiment (Motley, Camden, & Baars, 1982) should be produced as slips of the tongue. But both did not happen. The authors, and also Levelt (1995), concluded from these results that both the utterance of nonsense and the utterance of taboo words was timely suppressed by an internal pre-articulatory monitoring.

First, we can say: If a pre-articulatory monitoring happened in these experiments, it did not happen during overt speech, but before. Therefore, these experiments do not provide evidence for the internal feedback loop to work during overt speech. But do the reported observations provide any evidence for an internal pre-articulatory monitoring at all? The lexical bias, i.e., the fact that often wrong words, but rarely nonwords are produced with slips of the tongue, is simply explicable in the framework of the model proposed by Engelkamp and Rummer (1999) (see Section 1.2): The articulation of all familiar words is represented in the brain by speaking programs, but there are no speaking programs for nonwords, thus the articulation of a nonword requires to control the speech movements by putting together speaking programs of syllables or phonemes in the correct order, which rarely happens in everyday talking (in such exceptions, actually a kind of phonological encoding takes place).

Like the lexical bias, also the suppression of taboo words is not necessarily caused by an internal pre-articulatory monitoring of speech. Similar to other indecent behaviors, speaking taboo words may spontaneously be suppressed by the same ‘mechanism of fear’ that saves one from doing dangerous or awkward things, without requiring incessant self-monitoring. The ‘somatic markers’ posited by Damasio (1994) may represent such a mechanism. Damasio assumed that mental representations of concepts are linked to somatic markers so that, when a concept is activated in the brain, a feeling (i.e., the mental representation of a somatic state) is co-activated at the same time, providing an emotional estimation of the concept which helps the person to control behavior. In the case of a taboo word, the somatic marker, the feeling co-activated with the word, may be fear (of disgrace or penalty).

Therefore, no internal monitor is necessary for suppressing taboo words – it may rather be the activation of the taboo word itself (more precisely, the activation of its speaking program) which co-activates a feeling of fear that inhibits the program to start. However, apart from the question of what mechanism in brain may control these things: The appropriate use of a word requires a knowledge of its meaning including the knowledge of whether the word is taboo or not. (return)

Error detection by the speaker and by a listener

In the experiments reported by Lackner and Tuller (1979), subjects repeated syllable sequences like /pi-di-ti-gi/, /pi-œ-ti-o/, or /œ-i-o-u/ again and again for 30 seconds, one syllable per second paced by means of a blinking light. When the speaker or the listener, respectively, detected an error they pressed a button. Under these conditions, it is not surprising that speakers detected their own errors more quickly than listeners: First, the speaker has not only the acoustic information but also the tactile and kinesthetic information of his own speech movements; for example: If I start to say /ti/ instead of /pi/, I can feel the wrong movement before a listener can hear something. This is true for all syllables with consonant onset.

The listener completely depends on the acoustic information – and when listening to monotonously repeated syllable sequences like /pi-di-ti-gi/ etc., a listener may doubt for a short moment whether he has actually heard an error or not, thus he may be reluctant to press the button. I think, these two reasons last for explaining the longer error detection latencies of listeners compared to speakers – consider that, actually, Lackner and Tuller (1979) did not measure detection latencies only, but reaction times, because subjects had to press a button.

It had already become clear in the first part of the experiment, in which the latencies of error detection via the external and the internal feedback loop were compared, that it makes a difference whether a vowel or a consonant is wrong – indicating that the self-perception of articulation plays a role in error detection. Self-perception of articulation, however, is just not a pre-articulatory monitoring. (return)

to the top

next page