1.2. Speaking programs and acoustic word forms

A longer utterance is composed of phonemes, syllables, words, phrases, clauses, and sentences, but it also contains pauses for inhalation. I refer to all these differently long and differently complex units as ‘speech units’ or simply as ‘units’. Most of these units in itselves are sequences of units of lower complexity. Some kinds of units, e.g., clauses and sentences, are variable, that is, novel units of this kind are produced every day. Other kinds of units, e.g., phonemes, syllables, words, idioms, are relatively invariable: Sometimes, novel words and idioms are created, but novel syllables or phonemes emerge very rarely in one’s native language. Based on the considerations in the last section, we can assume that the articulation of relatively invariable speech units, after they were automatized, is feedforward-controlled by motor programs.

We assume that there is a particular program for the production of every phoneme, every syllable, every familiar word, and for frequently used phrases and idioms. Speech motor programs are also assumed in other models of speech production, e.g., in the DIVA model (see. e.g., Guenther, 2006). Segawa, Tourville, Beal, & Guenther (2015) define speech motor programs as “stored neural representations that encode the sequence of movements required to produce the utterance”. Programs are structured hierarchically: the program controlling the production of a syllable is the sequence of the programs for the phonemes, etc.

There is a further argument supporting the thesis that the production of relatively invariable speech units is controlled by motor programs. Based on the results of aphasia research, Engelkamp and Rummer (1999) have proposed a psycholinguistic model in which a word is represented in the brain in two different ways: as an acoustic word form and as a ‘speaking program’, that is a motor program controlling the production of the word. Figure 2 shows the relationship of acoustic word forms and speaking programs to concepts.

Speech production: motor programs and word forms

Figure 2: Acoustic word forms (= auditory memories/imageries of what words sound like) and speaking programs (= motor routines for the production of words). Only relationships within the brain are depicted (compare Fig. 4).

The acoustic word form allows to recognize (to identify) and to understand a perceived word. The speaking program, by contrast, is the motor program controlling the production of the word; it allows to express the concept immediately, as we usually do in spontaneous speech, without remembering the acoustic word form before. Acoustic word form and speaking program are linked not only via the concept (the semantic content), but also directly: We are quite able to repeat a phoneme sequence just heard – even if the content is unknown to us, or if no content exists (in the latter case, the phoneme sequence is called a pseudoword or a nonword). It is just this ability that allows us to learn new words by imitation, which is the basis of language acquisition. The direct link in the opposite direction – from speaking programs to acoustic word forms – is crucial for verbal thinking (inner speech); I will elaborate on this later.

A speaking program is a motor routine acquired by the repetitive production of a speech unit, that is, by practice. The speaking program of a multisyllabic word contains both the phoneme sequence and the syllable structure, including linguistic stress (word accent). Therefore, phoneme sequence and syllable structure do not need to be synchronized by the brain. This synchronization is already contained in a speaking program: When you learn to speak a new word, you acquire and automatize all – the phoneme sequence, the syllable structure and the linguistic stress – together and at the same time.

From the premise the speaking program of a word contains the phoneme sequence, it further follows that the brain does not need to put a familiar word together of the phonemes before speaking (phonological encoding). The consequence for a theory of stuttering is: Given that the production of all familiar speech units is controlled by speaking programs, we can exclude difficulties in phonological encoding as well as in synchronizing phoneme sequence with syllable rhythm to be the cause of stuttering.

Engelkamp and Rummer (1999) only claimed that the production of words is controlled by speaking programs, but this might be true for all speech units we can ‘reel off’ immediately from memory without any decision on articulation or formulation (read more). Therefore, I assume that not only words, but also frequently used phrases, idioms, and proverbs are controlled by speaking programs – and even memorized poems and lines of an actor’s part. It is just the ability to reel off those speech sequences which indicates that they are controlled by one program, and, as will be explained later, speaking programs may be the cause why many a stutterer is able to recite poems fluently from memory, or even to work as an actor or actress. Further, of course, there are speaking programs for all familiar syllables and phonemes; they can be combined to new words, or can be articulated individually if necessary, e.g., in spelling. In normal everyday talking, however, one speaking program mostly controls the production of a word or a short phrase.

The former considerations can be summarized as follows: To spontaneously speak familiar words and phrases is not so much a matter of planning, but more a matter of behavioral routines – “words are learned motor sequences.” (Van Riper, 1971, p. 393). These motor sequences, after they were acquired and automatized by repetition, are feedforward-controlled by motor programs here referred to as ‘speaking programs ’. And an alongside running, automatic, and widely unconscious monitoring ensures that the next speaking program can be executed only if the previous one is correct and complete. The latter is the topic of the next page.


to the top

next page



Patients affected by a Broca’s aphasia understand spoken language, but have difficulty finding the words when speaking. They speak disfluently, search for the words, and frequently use paraphrases. By contrast, patients affected by a Wernicke’s aphasia speak fluently, but have difficulty in speech perception, including the monitoring of their own speech. Often they are not well able to formulate coherent, intelligible sentences. These observations suggest that the production of familiar words and phrases is independent from speech comprehension and is based on motor routines. (return)

Acoustic word forms

Engelkamp and Rummer (1999) call them ‘acoustic word nodes’, referring to a connectionist model of speech processing. I prefer the more neutral term ‘acoustic word form’, because the difference between connectionist and hierarchic models of speech processing, in my view, plays no role in answering the question for the cause of stuttering. (return)

Concepts (contents of words)

The basis of language comprehension are links from words to nonverbal representations. that is, links to memorized visual, acoustic, or other sensory impressions and experiences. The word ‘dog’, for instance, might be linked to a visual pattern helping us to identify an animal just seen as a dog. Additionally, there may be links to acoustic memories of barking, and to other memories of personal experiences with dogs – all these are nonverbal representations giving the word ‘dog’ (the acoustic word form as well as the speaking program) a content.

However, what kind of nonverbal representations could words like ‘and’ or ‘definition’ be linked to? Aiming to explain the content of such words, we can hardly refer to sensory impressions. Instead, we must paraphrase the word, in order to explain its content, by means of other words. The contents of most of the words in an adult’s vocabulary might mainly consist of links to other words. What I want to say with all that is: Concepts are not a third kind of word representation in our brain, in addition to acoustic word forms and speaking programs – concepts are links.

The following consideration makes clear that acoustic word forms and speaking programs exist independently of semantic contents: With some practice, you can learn to speak and to rmemorize a nonword like /matula/ so that you recognize this nonword if you hear it some days later, and so that you can always produce it like any other word. In this way, you have acquired both the acoustic word form and the speaking program of /matula/ despite the fact that it has no content. (return)

Phonological encoding

The idea the brain transforms concepts into words and put the words together of the phonemes in order to generate a ‘phonetic plan’, from which, then, a sequence of articulatory movements is derived – this idea is very common and part of Levelt’s (1995) model of speech production. However, speaking a familiar word does not require me to remember its phoneme sequence for being able to articulate it. Instead, I immediately access to the speaking program, i.e., to the program controlling the motor sequence. I start the speaking program of the word, and I hear iis acoustic form (the sound sequence) while I am speaking it. To remember the sound sequence before speaking is necessary only if I’m not sure how to correctly articulate the word, that is, with unfamiliar words.

Note that the articulatory routine controlling the production of a familiar word not only contains a sequence of phonemes, as symbolized by the letters of the alphabet, but also the transitions between them and all word-dependent and dialect-dependent variations of the phonemes. And we have the ability to spontaneously vary these routines, e.g., we can substitute vowels ludically (for example: kangaroo, kingaroo, kongaroo, kanguree, etc.), and many people are able to switch into foreign or regional accents. Such abilities can hardly be the result of a computational process like the phonological encoding of an entry in a mental lexicon.

I think the concept of a ‘phonetic plan’ resulting from ‘phonological encoding’ is fundamentally wrong because, in this way of thinking, the human brain is confused with a computer. Speaking is a behavior not controlled by computation, but by learned motor routines. Only when we are learning to speak a new word, the motor sequence must firstly be put together of speaking programs of phonemes and/or familiar syllables, before it is automatized by practice. (return)

Linguistic categories and brain processes

We should consider that our linguistic categories – phoneme, syllable, word, clause, sentence – are not relevant for sensorimotor control in the brain: For example, the basic sensorimotor programs for speaking might rather be frequently occurring sound combinations learned in the babbling period than the pure single sounds of the alphabet – therefore, there might be no difference between phonemes and syllables on the level of speaking programs. And syllables are often used as monosyllabic words, without difference on the level of articulation, i.e., in terms of speaking programs.

Likewise, the difference between words, phrases, clauses, and sentences might hardly play a role on the level of sensorimotor control: Frequently used phrases and short sentences like “How do you do?” might be produced like words by means of only one speaking program; but to learn a new and long word – for example, a German compound like Bundesverfassungsgerichtsurteil (= federal constitutional court act) might require sequencing, similar as normally required in sentence production.

By the way: Since acoustic word forms and speaking programs are principally independent of semantic contents (see 3rd footnote). we have not to assume the existence of a ‘mental syllabary’, i.e., of a particular register of syllables without semantic content. There is no essential difference in the brain between speaking programs for words, familiar phrases, or a memorized poem on the one hand, and for syllables and phonemes on the other hand. Speaking programs may very differ in complexity and in the degree of hierarchization, but their decisive common trait is: They all are sensorimotor routines acquired by practice. (return)

to the top

next page