Self-monitoring of speech serves to detect errors (slips of the tongue), but also to check whether an utterance done matches the intended message. As already mentioned in the preceding sections, self-monitoring further plays an important role in speech acquisition and in the control of speech as a sensorimotor sequence: It ensures that the next speaking program can executed only after the current one is correct and complete. A further part of the self-monitoring of speech serves for the online-adjustment of speed, volume, pitch, and articulatory distinctness.
Speech errors are detected by comparing a speech unit (e.g., a word or phrase) just spoken with an expectation of its correct form, i.e., its correct sound sequence. A question important for our theme is: How can those expectations be generated? Levelt (1995) has believed that a ‘phonetic plan’ produced in the Formulator (see figure) is the basis of those expectations. A variety of this approach is the ‘efference copy’ theory. An efference copy is the (hypothesized) projection of a motor plan onto that sensory system, in which the movement and its success or failure plan is perceived (about efference copies assumed in speech monitoring in the context of stuttering see, e.e., Beal et al., 2010; Brown et al., 2005).
Hewever, the question can be posed whether efference copies are ever needed for the self-monitoring of speech. When listening to the speech of someone else, humans are quite able to detect errors immediately without having any copy of the speaker’s plan. Syntax errors (phrase structure violations) in spoken sentences elicit responses in a listener’s brain measurable as event-related potentials (ERPs) after ca. 120ms; semantic errors elicit ERPs after ca. 400ms (Friederici, 1999). Obviously, a listener is able to generate an expectation of what a speaker is going to say and what it should sound like, and to compare this expectation with the perception in a very short time.
How can a listener generate these expectations? First, he or she intuitively knows from experience in what way the initial words of a sentence constrain a speaker’s options of how to continue: the more words of a sentence have already been spoken the fewer syntactic and semantic options remain. That makes it easier for the listener to quickly generate expectations and to identify a perception not matching the expectation as a potential mistake. Simply said: It is the words already heard and the listener’s implicit knowledge of language that enables a listener’s brain to generate expectations of how the speaker must continue.
However, not only the words already heard enable us to predict the following ones. A listener often recognizes a familiar word already after hearing its initial portion, particularly if the word is embedded in a sentence context. The context facilitates the recognition of the word on the basis of few initial sounds and the prediction of its phoneme sequence. These assumptions are in line with Astheimer and Sanders (2009, 2012) who found by means of auditory event-related potentialss that both adults and preschool-aged children, when listening to connected speech, temporally modulate selective attention to preferentially process the initial portions of words. Already Halle and Stevens (1959) developed a model describing how phoneme sequences can be predicted on the basis of a minimal information input and, in this way, words can be recognized. This ‘analysis-by-synthesis’ model was updated by Poeppel and Monaban (2011) (read more).
In summary, we can say that listening and the implicit knowledge of language together enable a person to detect errors in the speech of someone else. Assuming now, according to Levelt (1995), that the same mechanisms that let us to detect errors in the speech of others also operate in monitoring one’s own speech. Then both the components necessary for the self-monitoring of speech – the expectation of the correct form of a speech unit and the perception of the speech unit produced – depend on auditory feedback (read more). In monitoring one’s own speech, however, it may be somewhat easier to generate expectations of the correct forms than in monitoring the speech of someone else, since the speaker knows the intended message of his own speech. That may be the cause why errors in one’s own speech are sometimes more quickly detected than errors in the speech of others, particularly semantic errors.
We now have a rough scheme of normal speech production. Its main features are:
Speaking is to produce a sensorimotor sequence. A speech sequence is composed of speech units like phonemes, words, phrases, clauses, and breathing pauses.
Speaking is mainly feedforward-controlled by speaking programs, that is, by motor routines controlling the production of familiar speech units. Phoneme sequence, syllable structure, and linguistic stress are integrated in a speaking program.
Speaking is accompanied by an automatic feedback-based self-monitoring, that interrupts speech flow when an error has been detected, in order to enable a correction.
Error detection works in the way that the speech unit just produced and perceived via the external auditory feedback is compared with an expectation of the correct sound sequence of the speech unit.
The expectation of the correct sound sequence of a speech unit is generated on the basis of the auditory feedback of the initial portion of the unit, supported by speaker’s knowledge of the intended message.
These basic assumptions about normal speech production provide the framework for the theory of stuttering presented in the next chapter.
to the top
The model describes how the brain can ‘guess’ and predict a word on the basis of only a few sounds perceived. Analysis by synthesis allows us to understand speech immediately and also to immediately detect phonological errors in the speech of someone else. It is assumed that a first vague prediction is updated step by step on the basis of additional sounds meanwhile perceived, and/or on the basis of the context (Poeppel and Monaban (2010)). By the way, the model explains our ability to understand the speech of someone speaking faultily, for instance, in a strong foreign accent. The ability to recognize a familiar word or phrase and to generate an expectation of its correct sound sequence, on the basis of a few initial phonemes, seems to be a special case of a general ability which enables us, for example, to identify a familiar musical composition on the basis of only some initial tacts.
Perhaps, it appears somewhat strange when I claim that auditory feedback is the basis not only of the perception of one’s own speech, but also of the prediction of the correct forms (sound sequences) of spoken words or phrases. The thesis has to do with a basic problem of linguistics, namely the relation between language and thought (see, e.g., Carruthers & Boucher, 1998). The specific question in our context is: Do we, in spontaneous speech, know the formulation of our sentences before we have spoken them? Quite a few people will claim they do. But would they also claim to know their thoughts, i.e., their internally formulated sentences, before they have perceived, i.e., heard them internally? Hardly they would. But if it is true that we become aware of our internally spoken sentences by hearing them internally – why should we believe that we become aware of our externally spoken sentences in another way than by hearing them externally?
It makes little sense to say we would think before thinking in order to formulate our thoughts, and it makes just as little sense to say we would, in spontaneous speech, think before speaking in order to formulate our sentences. Spontaneous speech is even thinking – only aloud. Clearly, unconscious brain processes precede the formulation of a sentence, in overt speech as well as in inner speech. But unconscious brain processes are no thoughts – thoughts are conscious. Therefore, in spontaneous speech, a speaker does neither know the formulation of a sentence nor the sound sequence of a word (in its specific inflection form), before he has spoken and heard the sentence. All what the speaker knows before is the intended message – as far as the term knowledge is appropriate here; I think, spontaneous speech is very often an immediate behavioral response to a situation, even without an awareness of an intended message: One word leads to another.
to the top