This
is basically for 50B and upwards:
However, if you are really interested in the information
about how the air molecules work in the vocal tract, this
is one of the best sources I have found to date
Voice Acoustics: an introduction
Speech science has a long history. Voice acoustics is an
active area of research in many labs, including our own,
which studies the science of singing, as well as the
speaking voice. This document gives an introduction and
overview. This is followed by a more detailed account
Throughout, a number of simple experiments are suggested to
the reader. As background, this link gives a brief multi-media
introduction to the operation of the human voice.)
|
Introduction and overview
A common simplification is the Source-Filter
model, which considers the voice as involving two
almost separate processes: the source produces an
initial sound and the vocal tract filter modifies
it. For example, at the larynx (sometimes called
voice box), we produce a sound whose spectrum
contains many different frequencies. Then, using
tongue, teeth, lips, velum etc, (collectively
called articulators) we modify the spectrum of
that sound over time. In this simple introduction
to the voice, we discuss the operations of first
the source of the sound, then of the filter
that modifies its spectrum, then of interactions
between these. In a second part, we then return to
look at the components in more detail.
The source
There are several sources of sound in
speaking. The energy usually comes from
air expelled from the lungs. At the
larynx, this flow passes between the vocal
folds.
In voiced speech, the vocal
folds (sometimes misleadingly called
vocal cords) vibrate. This allows puffs
of air to pass, which produces sound
waves. Here is the first of a series of experiments
(colour coded in the text): place your
fingers on your neck near your larynx (on
your Adams apple) then sing or speak
loudly. Can you feel the vibration? The
vibrations produced in voiced speech
usually contain a set of different
frequencies called harmonics. (See Sound
spectrum.)
In whispering, the folds do not
vibrate, but are held close together. This
produces a turbulent (irregular) flow of
air. This in turn makes a sound comprising
a mixture of very many frequencies, which
is called broad band sound. This
gives the windy sound that is
characteristic of whispering. Try some experiments:
can you whisper a note? What are the
differences between speaking very softly
and whispering loudly? Can you feel the
vibration in your neck when you whisper?
Can you sing in a whisper?
This allows us to divide speech sounds
into voiced, meaning that the
vocal folds vibrate, and unvoiced.
In pronouncing some sounds, such as the
f and ss in the word fuss,
turbulence is produced elsewhere: between
teeth and lips in f and between tongue
and hard palate in ss. Both of these are
unvoiced phonemes. (A phoneme is an
element of speech sound; unvoiced means
that the vocal folds dont vibrate.) Next
experiment:
try sustaining these sounds while feeling
for vocal fold vibration. The p and t
in pit are also unvoiced phonemes but
here the source of sound is related to the
sudden opening or closing of the air path.
Experiment:
at the end of 'pit', don't let your tongue
leave contact with the hard palate. What
does this tell you about the contribution
to 't' of closing and opening the path for
air flow? If you slow 'pit' down a lot and
exaggerate the 't', you can probably
notice a silent gap before the end of the
word. This may be a suprises. In the
sentence 'put it up' there are three
nearly silent gaps, one in each word.
There is no silence between 'put' and 'it'
or between 'it' and 'up'. You might want
to record the sentence and look at the
sound track.
Some voiced phonemes, such as
vowel sounds in normal speech, use
vibrations of the vocal folds with
relatively little turbulence. In others,
such as the v and z in viz or the
b and d in bid combining both the
sound from the larynx and the sound from
the constriction. Another experiment:
In whispered speech, by definition, no
phonemes are voiced, so the differences
between pit and bid disappear. Listen
for the difference in normal and whispered
speech. Try it with a friend, or record
the words out of context. Then try some
more examples using this table.
|
|
vowels |
fricatives |
plosives |
voiced |
normal vowels |
z
j v |
b
d g |
unvoiced |
whispered vowels |
ss sh
f |
p
t k |
|
The table compares some
pairs of phonemes that are pronounced with (nearly) the
same articulation either with vocal fold vibration
(voiced) or without vibration (unvoiced).
In fricatives, the tract is so constricted (by
tongue, palate, teeth, lips or a combination) that
sustained turbulent flow contributes broad-band sound to
the spectrum. Plosives involve opening and/or
closing of the tract with the lips (p, b) or the tongue
(t, d; k, g) at different places of articulation. The
sudden opening or closing and associated turbulence
briefly produce broad band sound in plosives. Experiment:
record this and look at the sound track. And also the
spectrograms, which brings us to the next topic.
The filter
The sound of the source interacts with the filter
(and also, as we'll see later, vice versa).
Depending on how you position your tongue and the shape of
your mouth opening, different frequencies will be radiated
out of the mouth more or less well. Another experiment:
sing a sustained note at constant pitch and loudness,
while varying the opening of your mouth and the position
of the tongue. This will allow you to produce most of the
vowels of English and some other phonemes, such as the
ll in all or the r in or, as pronounced in some
accents.
How you position your velum (soft palate) also makes a
difference. In the normal (high) position, all of the air
and sound goes through the mouth. Lower it and you connect
the nasal pathway to the mouth and lower vocal tract.
Lower it further and you seal the mouth off from the
pathway from nose to larynx. For the next experiment,
observe the differences between a nasal sound (ng) and a
non-nasal one (ah), then try sealing and unsealing your
nose with your fingers, and also opening and closing your
mouth, which will tell you how completely your velum seals
one of the pathways. (If you are not alone, and if you've
been doing all of these experiments, you may notice some
strange looks.)
Vowels
To a large extent, vowels in English are
determined by how much the mouth is opened, and where the
tongue constricts the passage through the mouth: front of
the mouth, back or in between. One can map the vowels in
terms of these articulatory details, or in terms of
acoustic parameters that are closely related to them. Here
are maps for two different accents of English.
The frequencies on the axes correspond to bands of
frequencies that are efficiently radiated, about which
more later. The vertical axis on these graphs roughly
corresponds to the jaw position (high or low) or the size
of the lip opening. The horizontal axis corresponds to the
position of the tongue constriction. Well return later to
explain more about such maps, and how they may be
obtained.

Vowel planes for two accents
of English (Ghonim et al., 2007, 2013). These data were
gathered in a large, automated survey in which
respondents from the US (left) and Australia (right)
identified synthesised words of the form h[vowel]d:
a form in which most examples are real words. short
and long indicate that more than 75% of the choices
fell in these categories.
Vowels and some other phonemes may be sustained over
time: for them, the position of the articulators (and so
the values of the well-radiated frequency bands) is
relatively constant. Experiment:
try holding the vowel in 'who'd' and slowly evolving it to
that in 'hard' and back again. Can you feel (or see in a
mirror) the aperture between your lips changing? Can you
do it without moving your tongue? Wihtout changing the lip
aperture? Now try 'head' to 'hoard' and back. Can you do
that without moving your tongue? Without changing the lip
aperture?
Consonants
For other phonemes (such as the p, b, t, d discussed
above), the change in articulation over time is important.
Consequently, so are the variations with time of the
associated frequency bands, as is the broad band sound
associated with the opening or closing (Smits et al.,
1996; Clark et al., 2007). In the examples p, b, t, d,
the mouth opening is obviously changing during the
consonant. Experiment:
try slowing down (a lot) the motion of opening and closing
and see if you notice what seems like a change in the
vowel.
Like vowels, liquids (r and l) and nasal consonants (n,
m, gn) can be voiced or unvoiced and like vowels they have
a characteristic set of spectral peaks. For these, the
tongue provides a narrower constriction.
In speech, vowels are in a sense less important than
consonants: you can often understand a phrase even
f ll vwl nfrmtn s
bsnt. On the other hand, vowels are more important in
singing, because the vowel is often extended in time to
produce a note, and sometimes decorated with loudness
variations, vibrato or a trill.
Source-filter interactions
The separation of parts of the voice function into
source and filter is practical, but one should
remember that the distinction is incomplete. For instance,
the geometry of the vocal folds affects not only the
operation of the folds and thus the source, but also
affects the acoustic properties of the filter. The
geometry of the vocal and nasal tracts determines how they
filter the sound, but the acoustical properties resulting
from this geometry are thought to affect the operation of
the vocal folds. We talk about these complications below.
Contrasting the voice with wind instruments
If we neglect the influence of the articulators on the
larynx, we have the Source-Filter model. Superficially, it
may seem obvious to a singer that the larynx and the
articulators are independent: to many singers,
particularly men in the low range, it seems that we can
vary the pitch (~ the source) and vowel (~ the resonator/
filter) independently.
In contrast, an analogous argument would seem a very
odd approximation to someone who plays brass
instruments. A trombonist knows that the resonances in the
bore of the instrument (~ the resonator/ filter) do indeed
affect the motion of the players lips (~ the source). In
fact, a brass players lips generally tend to oscillate at
one of the frequencies of resonances in the bore. (See Acoustics
of Brass Instruments) We shall return to this below,
but lets first note the following important quantitative
difference between the two.
A trombone has a range that overlaps that of a mans
voice. However, the trombone is longer (a few metres) than
a mans vocal tract (~ 0.2 m). Somehow, I have a trombone
range voice inside a piccolo-sized instrument! The range
of fundamental frequencies of the trombone lies within the
range of the trombone's resonances. The range of the
voice, especially of a mans voice, usually lies well
below the frequencies of the vocal tract resonances. You
are probably thinking that this difference and therefore
the approximation that the resonator doesn't affect the
source is most questionable for high pitches, when the
fundamental of the voice enters the range of vocal tract
resonances. You're right, and well come back to this.
The Source-Filter model
In the (simplest version of the) Source-Filter model
(Fant 1960), interactions between sound waves in the mouth
and the source of sound (usually the glottis) are
neglected. Although oversimplified, this model explains
many important characteristics of voice production.
The figure below is an experimental illustration of the
Source Filter model, using 3D-printed models of two
configurations of a vocal tract, corresponding to the
vowels in the words 'had' and 'heard'. (Click on diagram
for a higher
resolution version.) The source the periodic
laryngeal flow was synthesised, then measured as it was
input to the model tracts. Because the laryngeal flow is
periodic, its spectrum is made up of harmonics. The output
at the 'lips' was also measured. The gain functions (the
transpedance) of the tracts were measured independently.
They are shown below in both the time and frequency
domains.

Experimental
illustration of the simple Source-FIlter model. The
graphs in this figure were all measured experimentally.
Details
are in the paper "An
experimentally measured Source-Filter model: glottal
flow, vocal tract gain and radiated sound from a
physical model," Wolfe, J., Chu, D., Chen, J.-M.
and Smith J. (2016) Acoust. Australia, 44,
187191. It is available for reprint, with
acknowledgment.
Notice how the formants in the output sound (F1 to F4 or
F5) correspond approximately to the peaks in the gain
function due to resonances (R1 to R4 or R5). Notice too
that the output sound is periodic, but that it is
difficult to identify other features in the time domain.
As noted, the measurements above use 3D printed models of
the tracts (and indeed, books and papers on phonetics
usually just show cartoon sketches). Is it possible to use
direct measurements on real vocal tracts? The answer is
that we can't do it directly. However, we can make
indirect measurements. We cant measure the flow spectrum
through the larynx, but we can measure the vibration of
the vocal folds. Here, we do that using an
electroglottograph (EGG): we apply a small radio frequency
voltage across the neck using skin electrodes at the level
of the vocal folds. The magnitude of the current that
flows varies as the folds come into contact and separate.
The spectra and sound files at the top of the figure are
an EGG signal. Below that, we show the results of
measurements of the resonances of the vocal tract, made at
the mouth, during speech. This gives a quasi-continuous
line whose peaks identify the resonances. It also shows
the harmonics of the voice. (We discuss this technique here.)
Below that are the spectra measured for that particular
vowel, in the same gesture.
Here we contrast two vowels:
At left is the vowel [3], as in
heard (like one example used in the preceding figure).
At right, [o], as in hot. The top graphs and sound
files are for experimental measurements of the vocal
fold contact. Note that this measurement of the source
shows little difference between the two vowels: the
filter has little or no effect on the source. The next
pair of graphs are measurements of the vocal tract, made
from the mouth, during the vowel. (More on this
technique here.)
The broad peaks identify resonances of the vocal tract,
the sharp lines are the harmonics. Here, because the
tract is in a different configuration for the two
vowels, the resonances occur at different frequencies.
The next two rows show the voice output for voiced
speech and for whispering, measured in the same vocal
gesture. More detail on these examples here.
So, to summarise, the spectrum of the output sound
depends on the spectrum of the laryngeal source, on the
frequency dependent gain of the vocal tract, including
the efficiency of radiation from the mouth and nose.
Although we have not yet mentioned it, it also depends on
interactions among these. We shall discuss these in the
more detailed sections below.
Some difficulties
Before we leave this brief overview, it is worth noting
that there is still much about the voice that is still
incompletely understood. One of the reasons for this is
the difficulties of doing experiments. Some of the data
that we should like to know the gain function of the
vocal tract sketched above, the mass and force
distribution in the vocal folds, for instance are
impossible to measure while the voice is operating, not
only ethically but practically.
For most human physiology, much information has been
obtained from other species, whose organs function in
similar ways. When it comes to the voice, however, there
is no such similar species no-one is very interested in
the voice of the lab rat. Much of our knowledge comes from
experiments using just the sound of the voice as
experimental input. Other knowledge comes from medical
imaging. Another approach is to use a mathematical model:
one can treat the vocal folds as collections of masses on
springs, and the vocal tract as an oddly shaped pipe that
transmits sound. The next step is to solve the equations
for this simple system and to predict the sound it would
make, and to see how this correlates with sounds of speech
or singing. Another is to make artificial systems with the
shape of the vocal tract and some sort of aero-mechanical
oscillator at the position of the glottis. Yet other
knowledge comes from other experiments and observations
that are often, for practical and ethical reasons,
somewhat indirect. Because of the importance of the human
voice, these are all active research areas.
We now look more closely at some of the topics introduced
above. Other reviews are given by, for example, Lieberman
and Blumenstein, 1988; Titze, 1994; Stevens, 1999;
Hardcastle and Laver, 1999; Johnson, 2003; Clark et al.,
2007; Wolfe el al, 2009; Zhang, 2016; Garnier et al, 2020.
References are given below.
The source at the larynx
To speak or to sing, we usually expel air from the
lungs. The air passes between the vocal folds, which are
muscular tissues in the larynx. If we get the air pressure
and the tension and position of the vocal folds just
right, the folds vibrate at acoustic frequencies. This
means we have an oscillating valve, letting puffs of air
flow into the vocal tract at some frequency fo.
These sketches illustrate
the larynx, viewed from above, in position for phonation
and for breathing.
Technically, we move the arytenoid cartillages closer
than their separated breathing position, which brings the
vocal folds closer to each other, called adduction
(Scherer 1991). This reduced aperture between the folds is
called the glottis. Compared to the breathing
position, the narrow glottis restricts the flow of air,
which in turn means that the steady pressure drop across
the larynx is greater when the aperture is small. The
higher pressure drop means that the speed of air through
the glottis is high, but the small cross section means
that the volume flow (in litres per second) is less.
Experiment:
take a deep breath and time how quickly you can breathe it
out completely with your larynx relaxed. Now do the same,
while pronouncing a whispered ah and singing a (loud)
ah. Which breath lasts longest (i.e. which has the
lowest flow)? If you have gone from a large inhalation to
maximum exhalation, you have probably expelled about five
litres. Divide that by the time taken to get the flow
rate. (More about flow rate and air speed here.)
The schematic at right sketches the vocal folds
in cross section, seen from in front. In (a), the
pressure acting below the vocal folds tends to
force them upwards and apart. Ths pressure
difference is also responsible for accelerating
air through the glottis to produce the high-speed
air flow: the blue arrow in (b).
The rapid air flow through the glottis creates a
suction (black arrows). This suction tends to draw
the folds back together (via the Bernoulli
effect, Van den Berg 1957), as does the
tension in the folds themselves. These alternating
effects tend to excite a cycle of closing and
opening of the folds, which is assisted by the
inherent springiness or elasticity of the folds
(which provide a restoring force), their mass
(which provides inertia) and the inertia of the
air flow itself, which maintains high flow rates
even as the folds are closing.
The fold motion can involve a surface wave
(Titze, 1988) in which the phase of the lateral
motion of the surface of the folds varies as a
function of height. Under suitable conditions,
this can extract energy from the DC flow and
convert it into oscillatory energy. Another effect
that converts energy for auto-oscillation is the
upwards sweeping motion of the folds (Boutin et
al., 2015), whose phase leads that of the
horizontal motion (George et al, 2008).
Depending on the acoustic
impedance of the tracts above and below the
folds, and on passive mechanical and geometric
properties of the tissue, this can lead to
self-sustained oscillation (Helmholtz, 1877; van
den Berg, 1958; Flanagan and Landgraf, 1968;
Fletcher, 1979; Awrejcewicz, 1990; Fletcher, 1993;
Elliot and Bowsher, 1982; Titze, 1988; Titze,
1994; Adachi and Yu, 2005; Boutin et al., 2015).
Muscles do not directly vibrate the vocal folds,
which is a passive effect described above (Van den
Berg, 1958). However, muscles contribute to its
control, by determining how much the folds are
pushed together and how much they are stretched.
If you get these parameters right, and hold them
steady, you can produce a note with a fixed pitch,
which means that the folds are vibrating in a
regular, periodic way. Thats what we (usually) do
in singing. In normal speech, the pitch varies
during each syllable, usually in a smooth way.
The fundamental frequency for speech ( fo)
is typically 100 to 400 Hz. For singing, the
range may be from about 60 Hz to over
1500 Hz, depending on the type of voice. The
speed of sound c is about 340 m.s−1, so
the wavelengths of the fundamentals
(λ = c/f) are roughly 1 to
3 metres, but can be as short as 0.3 m
or less for high-pitched singing. So here is an
important point: the wavelength is usually, but
not always, rather longer than the vocal tract
itself, which is typically 0.15-0.20 m
from mouth to glottis. This makes the voice very
different from typical musical instruments: a man
whose vocal tract is the size of piccolo still
manages to produce a vocal range corresponding to
that of a trombone or bassoon. In the artificial
instruments, a resonance of the bore largely
controls the pitch. For the voice, the resonances
of the vocal tract rarely control the pitch.
Rather, the pressure provided by the lungs and the
tension in the vocal folds together largely
determine both loudness and pitch, but resonances
in the vocal tract can make a big difference too,
as we'll see below.
|
Different registers and vocal mechanisms
How to cover a wide range of pitch? Lets compare with
musical instruments (See Standing
Waves for a discussion). On a violin or guitar, one
can change the length of a string, but to cover a large
range, one can also cross to a new string. In trumpet,
trombones, clarinets, flutes etc, one can change the
length of a pipe (with valves, a slide or keys) but one
can also change registers, which means changing the mode
of vibration in the pipe.
In the voice, we can change the muscle tension and the
pressure to vary the pitch. However, to cover a range of a
few octaves, we usually need to use different registers
(Garcia, 1855). The distinctions among registers in
singing are not always clear, however, because changing
registers corresponds to both laryngeal and vocal tract
adjustments (Miller, 2000). The vocal folds can vibrate in
(at least) four different ways, called mechanisms (Roubeau
et al., 2004; Henrich, 2006).
- Mechanism 0 (M0) is also called creak or
vocal fry. Here the tension of the folds is so low
that the vibration is not periodic (meaning that
successive vibrations have substantially different
lengths). M0 sounds low but has no clear pitch (Hollien
and Michel, 1968). Experiment:
if you hum softly the lowest note you can and then go
lower, you will probably produce M0.
- Mechanism 1 (M1) is usually associated with
what women singers call the chest register and men
call their normal voice ('modal' voice for singers).
This is used to produce low and medium pitches. In M1,
virtually all of the mass and length of the vocal folds
vibrates (Behnke, 1880) and frequency is regulated by
muscular tension (Hirano et al., 1970) but is also
affected by air pressure. The glottis opens for a
relatively short fraction of a vibration period (Henrich
et al., 2005).
- Mechanism 2 (M2) is associated with the head
register of women and thefalsetto register in men. It
is used to produce medium and high pitches for women,
and high frequencies for men. In M2, a reduced fraction
of the vocal fold mass vibrates. The moving section
involves about two thirds of their length, but less of
the breadth. The glottis is open for a longer fraction
of the vibration period (Henrich et al., 2005).
- Mechanism 3 (M3) is sometimes used to describe
the production of the highest range of pitches, known as
the whistle or flageolet register (not to be
confused with whistling) (Miller and Shutte, 1993).
Relatively little has been published on this. (We have
been researching it lately. Garnier et al, 2010; 2012.)
Although some people use M0 in speech, especially at the
end of sentences, and coloratura sopranos use M3 in their
highest range, speech and singing usually use M1 and M2.
Men and women typically change from M1 to M2 at about
350-370 Hz (F4-F#4) (Sundberg, 1987), which is often
called a 'break' in the voice. Consequently, with their
lower overall range, men typically use M1 for nearly all
speech and most singing. However, in some styles of pop
music and some operatic styles, men use M2 extensively:
men who sing alto are usually using M2. For women singers,
the situation depends on vocal range. Sopranos sing in M2
and usually extend its range downwards to avoid the
'break' over their working range. High sopranos may use
M3. Altos often use both M1 and M2.
There is usually a pitch and intensity range over which
singers can use either M1 or M2 (Roubeau et al., 2004),
and trained singers are good at disguising the transition.
Sometimes, as in yodeling, the transition is a feature. Experiment:
if you try to produce a smooth pitch change or glissando
over your whole range, you will probably notice a
discontinuity: a jump in pitch and a change in timbre at a
pitch somewhere near the bottom of the treble clef. This
is where you change from M1 to M2. At the pitch of that
break, you may also produce a break by singing a crescendo
or decrescendo at constant pitch (see Svec et al., 1999;
Henrich, 2006).
Sopranos often have a considerable overlap region for M2
and M3. Those that sing with M3 also use a different form
of resonance tuning in the high range. This gives a
complicated set of possible strategies for singing in the
high range (Garnier et al, 2010; 2012).
The next figure shows a spectrogram of a glissando
through the four mechanisms.
A spectrogram plots
frequency (vertical) against time (horizontal) with
sound level in colour or grey-scale. This one shows the
four laryngeal mechanisms on an ascending glissando sung
by a soprano. Notice the discontinuities in frequency
(clearer in the higher harmonics) at the boundaries
M1-M2 and M2-M3. The horizontal axis is time, dark
represents high power, and the horizontal bands in the
broad band M0 section clearly show four broad peaks in
the spectral envelope. These may also be seen to varying
degrees in the subsequent harmonic sections. This
glissando in wav. Spectrogram
above in .jpg.
Producing a sound
The processes that convert the DC or steady pressure
in the lungs into AC or oscillatory air flow and vocal
fold vibration involve nonlinear effects. First, the
Bernoulli suction between the folds is proportional to
the square of the flow velocity (see this
link). Second, the collision of the folds when the
glottis closes is also highly nonlinear (Van den Berg,
1957; Flanagan and Landgraf, 1968; Elliot and Bowsher,
1982; Fletcher, 1993).
In science, linear just means that the
equation is a straight line, so a change in one variable
produces a proportional change in the other. We show elsewhere
that an oscillator with a linear force law vibrates in a
pure sine wave, which has just one spectral component.
Conversely, anything with a nonlinear force law does not
vibrate sinusoidally, and so has more than one frequency
component. For some non-scientists, linear and nonlinear
have been confused by postmodernism, where the words are
used metaphorically.
Because of nonlinearities, the fold vibration is
nonsinusoidal and therefore has many frequency components.
In M1, M2 and M3, the motion is (almost exactly) periodic,
so the spectral components are harmonic: a microphone or
flow meter placed at any point in the tract would indicate
components at the fundamental frequency f0
and its harmonics 2f0, 3f0
etc, as shown in the figures above. (Follow this
link for harmonic spectra.)
Generally, the amplitude of harmonics decreases with
increasing frequency, though there are important
exceptions. The negative slope in the spectral envelope
(called the spectral tilt) is different for types of
speech or singing (Klatt and Klatt, 1990). To some extent,
this slope is compensated by the response of the human
ear, which is usually more sensitive to the higher
harmonics than to the fundamental (see Hearing).
More power in the high harmonics makes a sound bright and
clear; weakening the high harmonics makes a mellow, darker
or muffled sound. If you have a sound system with bass and
treble or tone controls, or a sound editing program, you
can experiment
with strengthening and weakening the high harmonics using
the treble or tone control. (Some filtered voice sound
examples are here.)
A breathy voice has a spectrum with a strongly negative
slope. This voice is produced when the glottis doesnt
close completely. The spectral envelope is flatter (the
higher harmonics are less weak) in loud speech or singing,
which have an abrupt closure of the vocal folds and a
short open phase of the glottis (Childers and Lee, 1991;
Gauffin and Sundberg, 1989, Novak and Vokral, 1995). This
flatter spectrum has relatively more power in the
frequency range 14 kHz, to which the ear is most
sensitive.
It is possible to make high-speed video images of the
vocal folds using an optical device (endoscope) inserted
in either the mouth or nose (Baken and Orlikoff, 2000;
Svec and Schutte, 1996). Electroglottography (Childers and
Krishnamurthy, 1985), which is described above, is less
invasive but less direct. Although the flow through the
glottis cannot be measured, it can be estimated from the
flow from the mouth and nose, which can be measured using
a face mask (Rothenberg, 1973) or from the sound radiated
from the mouth. Both techniques require inverse filtering
(Miller 1959), which in turn requires knowledge of or
assumptions about the acoustic effects of the vocal tract.
When is the source independent of the filter?
As explained above, one cannot do the direct experiments
that would allow us to answer this question directly, so
we are obliged to rely on indirect evidence, or on
theoretical or numerical models.
In a simple
model, Fletcher (1993) uses the 'Bernoulli'
nonlinearity in a simple but general analysis of
resonator-valve interaction with different valve
geometries. He derives equations and inequalities relating
the natural frequencies of the valve, the resonance
frequency of the filter (or resonator) and the fundamental
frequency of the sound produced. Treating the vocal folds
(or a trombonists lips) as a valve that opens when the
upstream pressure excess is increased, this model gives
results consistent with some of what we know about the
voice and trombones: when the resonance falls at a
frequency slightly above that of the valve, a sufficiently
strong resonance can control the oscillation regime. If
the resonances are at much higher frequencies, they have
little influence on the fundamental frequency at which the
valve vibrates. (Our contribution has been quantification
of the work done by the longitudinal sweeping flow of the
vocal folds, Boutin et al., 2015.)
Resonances, spectral peaks, formants, phonemes and
timbre
Acoustic resonances in the vocal tract can produce peaks
in the spectral envelope of the output sound. In speech
science, the word formant is used to describe either the
spectral peak or the resonance that gives rise to it. In
acoustics, it usually means the peak in the spectral
envelope, which is the meaning on this site. We discuss
the different uses in more detail on What
is a formant?, but for the moment note that
formant should be used with care.
Phoneme
In non-tonal languages such as English, vowels are
perceived largely according to the values of the formants
F1 and F2 in the sound (Peterson and Barney; 1952, Nearey,
1989; Carlson et al., 1970), as we've seen above. F3 has a
smaller role in vowel identification. F4 and F5 affect the
timbre of the voice, but have little influence on which
vowel is heard (Sundberg, 1970). We repeat below the plots
of (F2,F1) for two accents of English. Note that, in these
graphs, the axes do not point in the traditional Cartesian
direction: instead, the origin is beyond the top right
corner. The reason is historical: phoneticians have long
plotted jaw height on the y axis and fronting', the place
of tongue constriction, on the x. This choice maintains
that tradition approximately.
These maps were obtained in a web experiment, in which
listeners judge what vowel has been produced in synthetic
words (Ghonim et al., 2007, 2013) in which F1, F2 and F3
are varied, as well as the vowel length and the pitch of
the voice.

We repeat the figure showing
the vowel planes for US and Australian English measured
in an on-line
survey (Ghonim et al., 2007).
The vocal tract as a pipe or duct
To understand how the resonances work in the voice, we
can picture the vocal tract (from the glottis to the
mouth) as a tube or acoustical waveguide. It has
approximately constant length, typically 0.15-0.20 m long,
a bit shorter for women and children. However, the cross
section along the length varies in ways that can be varied
by the geometry of the tongue, mouth etc. The frequencies
of the resonances depend upon the shape. The frequencies
of the first, second and ith resonances are
called R1, R2, ..Ri.., and those of the spectral peaks
produced by these resonances are called F1, F2, ..Fi...
(See this
link for a discussion of the terminology.)
When pronouncing vowels, R1 takes values typically
between 200 Hz (small mouth opening) to 800 Hz. Increasing
the mouth opening gives a large proportional increase in
R1. Opening the mouth also affects R2, but this resonance
is more strongly affected by the place at which the tongue
most constricts the tract. Typical values of R2 for speech
are from about 800 to 2000 Hz. The resonant frequencies
can also be changed by rounding and spreading the lips or
by raising or lowering the larynx (Sundberg, 1970; Fant,
1960).
Well return to discuss this below, but for the moment,
lets note that, if the open end of a tube is widened, the
resonant frequencies rise, which explains the mouth
effect. Similarly, reducing or enlarging the cross section
near a pressure node respectively lowers or raises the
resonance frequency. Conversely, reducing or enlarging the
cross section near a pressure anti-node respectively
raises or lowers the resonance frequency. This explains
some features of the tongue constriction. The nasal tract
has its own resonances, and the nasal (nose) and buccal
(mouth) tracts together have different resonances. The
lowering the velum or soft palate couples the two, which
affects the spectral envelope of the output sound (Feng
and Castelli, 1996; Chen, 1997).
Nasal vowels or consonants are produced by lowering the
velum (or soft palate, see Figure
1). The nasal tract also exhibits resonances.
Coupling the nasal to the oral cavity not only modifies
the frequency and amplitude of the oral resonances, but
also adds further resonances. The interaction can produce
minima or pole-zeros of the vocal tract transfer function,
with resultant minima or holes in the spectrum of the
output sound (Feng and Castelli, 1996; Chen, 1997).
Resonances, frequency, pitch and hearing
Some comments about frequency and hearing are
appropriate here. The voice pitch we perceive depends
largely on the spacing between adjacent harmonics,
especially those harmonics with frequencies of several
hundred Hz (Goldstein, 1973). For a periodic phonation,
the harmonic spacing equals the fundamental frequency of
the fold vibration, but the fundamental itself is not
needed for pitch recognition.
Except for high voices, the fundamental usually falls
below any of the resonances, and so may be weaker than one
of the other harmonics. However, its presence is not
needed to convey either phonemic information or prosody in
speech. The pass band of hard-wire telephones is typically
about 300 to 4000 Hz, so the fundamental is usually absent
or much attenuated. The loss of information carried by
frequencies above 4000 Hz (e.g. the confusion of f and
s when spelling a name) is noticed in telephone
conversation, but the loss of low frequencies is much less
important. (An experiment:
next time you are put on hold on the telephone, listen
to the bass instruments in the music. Their fundamental
frequencies are not carried by the telephone line. Can you
hear their pitch? Of course, they are less 'bassy' than if
you heard them live, but is the pitch any different?)
Our hearing is most sensitive for frequencies from 1000
to 4000 Hz. Consequently, the fundamentals of low voices,
especially low men's voices, contribute little to their
loudness, which depends more on the power carried by
harmonics that fall near resonances and especially those
that fall in the range of high aural sensitivity. (Another
experiment:
you can test your own hearing sensitivity on this
site.)
Timbre and singing
Varying the spectral envelope of the voice is part of
the training for many singers. They may wish to enhance
the energy in some frequency ranges, either to produce a
desired sound, to produce a high sound level without a
high energy input, or to produce different qualities of
voice for different effects. Characteristic spectral peaks
or tract resonances have been studied in different singing
styles and techniques (Stone et al., 2003; Sundberg et
al., 1993; Bloothooft and Pomp, 1986a; Hertegard et al.,
1990; Steinhauer et al., 1992; Ekholm et al., 1998; Titze,
2001; Vurma and Ross, 2002; Titze et al., 2003; Bjorkner,
2006; Garnier et al., 2007b; Henrich et al., 2007). In
this laboratory, we have been especially interested in
three techniques: resonance
tuning, harmonic
singing and the singers formant.
The origin of vocal tract resonances
Vocal tract resonances (Ri) give rise to peaks
in the output spectrum (Fi). However, the relation
between Ri and Fi is a little subtle. For that
reason, lets consider the behaviour of some
geometrically simple systems, for which acoustical
properties can be more easily calculated,
illustrated in the cartoons here and below. (This
section follows Wolfe et al, 2009.)
In the top sketch, we have straightened out the
vocal tract. Below, it is modelled as a simple
cylindrical pipe to explain, only qualitatively,
the origin of the first two resonances. Below we
give theoretical calculations for the input
impedance spectrum and a transfer function for
simplistic models of the vocal tract with
length = 170 mm and
radius = 15 mm. The dashed line is
for a cylinder. When a circular glottal
constriction is added, with a radius of 2 mm
and an effective length of 3 mm (including end
effects), the result is the solid line. This
figure is taken from Wolfe et al (2009).
|
At this stage, it is helpful to introduce the
acoustic impedance, Z, which is the ratio of
sound pressure p to the oscillating component of the
flow, U. (This
link gives an introduction to acoustic
impedance.) Z is large if a large variation
in pressure is required to move air, and conversely.
Z is a complex quantity, meaning that p
and U are not necessarily in phase, so that
Z has both a magnitude (shown in the plots at
right) and a phase. The in-phase component (the real
component when complex notation is used) represents
conversion of sound energy into heat. Components
that are 90° out of phase (imaginary components in
complex notation) represent storage of energy. A
small mass of air in a sound wave can store kinetic
energy but, because of its inertia, pressure is
required to accelerate it. It has an inertive
impedance (p is 90° ahead of U,
positive imaginary component). Flow of air into a
small confined space increases the pressure, storing
potential energy. This is compliant
impedance (p is 90° behind U,
negative imaginary component). When the dimensions
of a duct are not negligible in comparison with the
wavelength, p and U vary along its
length. Z often varies strongly with
frequency and the phase changes sign at each
resonance. |
Tract-wave interactions
Now the mouth is open to the outside world, but the
sound wave is not completely free to escape, because of
Zrad, the impedance
of the radiation field outside the mouth. A pressure
p at the lips is required to accelerate a small mass of
air just outside the mouth, so the inertance is not zero,
but is usually Zrad small. At high
frequency, however, larger accelerations are required for
any given amplitude, so Zrad increases
with frequency. In a confined space (inside the vocal
tract), acoustic flow does not spread out, so impedances
are usually rather higher than Zrad.
As we explain in this
link, Z in a pipe (or in the vocal tract)
depends strongly on reflections that occur at open or
closed ends. A strong reflection occurs at the lips, going
from generally high Z inside to low Z in
the radiation field. Suppose that a pulse of high-pressure
air is emitted from the glottis just when a high pressure
burst pulse returns from a previous reflection: the
pressures add and Z is high. Conversely, if a reflected
pulse of suction cancels the input pressure excess, Z
is small. This effect produces the large range of Z
shown in the previous figure. High output levels occur at
the lips when the input impedance Z is a minimum.
For the sake of simplicity, lets imagine the tract as a
tube, nearly closed at the glottis but open at the mouth.
In fact, for /3/ (the vowel in the word "heard"), the
resonances shown in the figure above fall at the
frequencies expected for an open cylindrical tube of
length 170 mm, open at the mouth and nearly closed at the
glottis. Now, for a simple tube with length L, open at the
far end, the behaviour is shown by the dashed line in the
preceding figure. The wavelengths that give maxima in Z
are approximately λ1 = 4L, λ3 = 4L/3,
λ5 = 4L/5, etc and so at frequencies,
f1 = c/4L, f3 = 3c/4L = 3f1,
f5 = 5c/4L = 5f1,
etc. Minima occur half way between the maxima.
Now lets add the glottis, giving a local constriction at
the input. The solid line shows the new input impedance Z.
The maxima in Z (pressure antinodes or flow nodes)
are hardly changed. This makes sense: a local constriction
(of small volume) at the input has little effect on a
maximum in Z, where flow is small. For modes where
the flow is large, however, the air in the glottis must be
accelerated by pressures acting on only a small area. So
the frequencies of the minima in Z (pressure node,
flow antinode) fall at lower frequencies. If the glottis
is sufficiently small, Z(f) falls abruptly
from each maximum to the next minimum, which thus occur at
similar frequencies. So do the maxima in the transfer
functions.
So far, we havent mentioned the impedance of the
subglottal tract leading to the lungs. This is difficult
to measure. However, there are good reasons to expect no
strong resonances in the audio frequency range. The lungs
have complicated geometry, with successively branching
tubes, extending to quite small scale at the alveoli. This
is expected to produce little reflection in the range of
frequencies that interest us (see Fletcher et al., 2006).
As mentioned above, many of the obvious experiments for
studying vocal tract resonances are impossible. A number
of less obvious techniques exist, however. One of our
papers reviews these (Wolfe et al, 2009).
Tract-wave interactions: Do the source and the
filter affect each other?
As we explained above, the resonances of the vocal tract
occur at frequencies well above those of the fundamental
frequency at least for normal speech and low singing.
Further, the frequencies of vocal fold vibration (which
gives the voice its pitch) and those of the tract
resonances (which determine the timbre and, as we have
seen, the phonemes) are controlled in ways that are often
nearly independent. In most singing styles, the words and
melody of a song are prescribed. Conversely, in speech, we
have the subjective impression that we can vary the
prosody independently of the phoneme for example, one
can often replace a key word in a sentence without
changing the prosody at all.
As mentioned above, the voice is unlike a trombone or
other wind instrument*, in which one of the resonances of
the air column drives the player's lips or reed
(respectively) at a frequency close to its resonant
frequency. In the voice, there is usually no simple
relation between the frequencies: a singer may cover a
range of two or more octaves (i.e. vary the frequency by a
factor of 4 or more) with relatively little change in the
shape and size of the vocal tract. Further, although there
is typically a difference of an octave (a factor of two in
wavelength) between the fundamental frequencies of male
and female singing voices, there is much smaller
difference in the lengths of the tracts.
From this we can conclude that the resonances of the
tract do not normally control the pitch frequency of the
voice. Nevertheless, the glottal source and the vocal
tract resonances may be interrelated in a number of ways.
First, there are direct, physical interactions: the mode
of phonation affects the reflections of sound waves at the
glottis, and so affects standing waves in the tract (cf.
Fig. 4). Second, pressure waves in the tract can influence
the air flow through the glottis or the motion of the
vocal folds. Third, there is the possibility that speakers
and singers may consciously or unconsciously use
combinations of fundamental frequency and resonance
frequency for different effects, in particular to improve
their efficiency. We discuss these in turn.
* Is there an acoustic instrument like
the voice? Not really, but one can mention some
similarities with the harmonica or mouth organ. In that
instrument, the pitch is largely determined by
mechanical properties of a metal reed that controls the
air flow. The pitch may, however be affected by effects
in the acoustic field nearby, e.g. cupping the hands
over the instrument to bend tones. Like the voice, the
harmonica may produce sounds whose wavelengths are much
larger than the size of the instrument and, like the
voice, one can modify the spectral envelope by changing
the geometry of the air space through which it radiates.
Does the glottis affect the tract resonances?
The glottis is very much smaller that the cross-section
of the vocal tract, which is why, in the simplistic figure
above, we treated the vocal tract as a pipe open at mouth
and closed at glottis. This is an exaggeration, of course!
The average opening of the glottis depends on what
fraction of the time it is open (its open quotient) and
how far it opens (Klatt and Klatt, 1990; Alku and Vilkman,
1996; Gauffin and Sundberg, 1989), which in turn depend on
the voice register and pitch.
For a duct that is almost closed at one end and open at
the other, the frequency of the first resonance increases
as the opening increases. Various researchers have shown
that, when the glottis is somewhat open for whispering,
the resonance or formant peaks occur at higher frequencies
(Kallail and Emanuel, 1984a,b; Matsuda and Kasuya, 1999;
Itoh et al., 2002; Barney et al, 2007; Swerdlin et al.,
2010).
Do pressure waves affect the vocal fold vibration?
This is an area in which its hard to do the experiments
that would most clearly answer the question. However,
there has been a lot of work on numerical models. Some of
these predict that air through the glottis and the vocal
fold vibrations depend on the pressure difference across
the glottis and folds, and thus waves in the tract
(Rothenberg, 1981; Titze, 1988, 2004). Not surprisingly,
the phase of the pressure wave is important in these
models: whether a pressure decrease outside the vocal
folds will tend to open them will depend on when during a
cycle it arrives.
Can one observe the effect of pressure waves on the
motion of vocal folds experimentally? Hertegard et al.,
(2003) used an endoscope (camera looking down the throat)
to film the larynx while singers mimed singing, and a tube
sealed at the lips provided artificial pressure waves.
They reported bigger vibrations when the pressure waves
had frequencies near those of normal singing. In our lab
(Wolfe and Smith, 2008), we used electroglottography (EGG,
described above) to monitor the vocal fold vibration, and
used a didjeridu to produce the pressure waves. We found
that the didjeridu signal could drive the folds at a level
comparable with those generated by singing. All the above
evidence suggests that the standing waves in the filter
have a strong interaction with the source.
Do singers and speakers use tract resonances and pitch
in a coordinated way?
If you want to sing or to speak loudly, you might want
to take advantage of the resonances of the vocal tract to
improve the efficiency with which energy is transmitted
from the glottis to the outside sound field. The most
studied example is the problem faced by sopranos. The
range of R1 (about 300 to 800 Hz, roughly D4 to G5)
overlaps approximately the range the soprano voice. If a
soprano did nothing about this, shed have a serious
problem: First, for many note-vowel combinations, the f0
of the note would fall above R1. So she would lose the
power boost from R1. This is particularly a difficulty for
opera singers, who must compete with an orchestra, without
the aid of a microphone. There is also the problem that
her voice quality would tend to change when she crossed
the R1 = f0 line.
Sundberg and colleagues pointed out that, in classical
training, sopranos learn to increase the mouth opening as
they ascend the scale (Lindblom and Sundberg, 1971,
Sundberg and Skoog, 1997) and measured this opening as a
function of pitch. They deduced that they were tuning R1
to a value near f0.
Our experiments, using acoustic excitation at the mouth,
confirmed this (Joliveau et al., 2004a,b). When f0
was low enough, sopranos used typical values of R1 and R2
for each vowel. However, when f0 was equal or greater to
the usual value of R1, they increased R1 so that it was
usually slightly higher than f0. For vowels with low R1,
this tuning of R1 to f0 starts at lower pitch, and it
continues almost up to 1 kHz. Here is a web page about
this research, including some sound files.
We dont know exactly how they learn to do this: it might
be that they respond, probably subsciously, when the sound
is louder for a given effort. Or it may be that vibrations
are easier to produce when the resonance is appropriately
tuned. Either way, they could learn to reproduce this
effect. A simple model shows that outward opening valves
(those that open in the direction of the steady flow) tend
to be driven most easily at frequencies a little below the
resonance: the model valves drive inertive loads better
than compliant ones (Fletcher, 1993).
What about other singers? In much of the alto range, and
for some vowels in the high range of mens voices, the
same problem arises and, although it is much less studied,
similar effects are occasionally, but not universally
observed (Henrich et al, 2011). Further, some singers seem
to tune R1 to the second harmonic (i.e. to 2f0)
over a limited range (Smith et al., 2007). In another
study, a practitioner of a very loud Bulgarian womens
singing style, was found to tune R1 to 2f0
(Henrich et al., 2007).
Finally, it is worth noting that it is difficult to tune
R1 much above 1 kHz, in part because it is hard to
open one's mouth wide enough. Some sopranos who practise
the very range of the coloratura soprano, or the whistle
voice in pop music, tune R2 to f0,
above about C6, which gives them up to another octave or
so in their whistle or M3 mechanism (Garnier et al, 2011).
This figure, from Kob et al (2011), shows the different
tuning strategies that may be used by different voice
categories. Oversimplifying for the sake of brevite, low
voices may tune R1 (or R2) to harmonics of the voice.
Altos, especially in belting and in the Bulgarian style,
tune R1 to the second harmonic. Sopranos tune R1 to f0 up
to high C and above that tune R2 to f0. Men sometimes tune
R1 to high harmonics of the voice. See Henrich et al
(2011) for details.
Harmonic singing
In a range of styles known as harmonic or overtone
singing, practitioners use a constant, rather low
fundamental frequency (in a range where the ear is not
very sensitive). They then tune a resonance to select one
of the high harmonics, typically from the about the fifth
to the twelfth (Kob, 2003; Smith et al., 2007). We have a
web
site about this.
Is resonance tuning used in speech?
Some speakers (actors, public speakers, teachers) have
to speak long and loud. Resonance tuning might be easier
for them in one sense: unlike (most) singers, they get to
choose the pitch for every word. Some preliminary research
suggests that resonance tuning is used in shouting
(Garnier et al., submitted).
The singers formant
Male, classically-trained singers often show a spectral
peak in the range 24 kHz, a range where the ear is
quite sensitive. This spectral peak is called the singers
formant (Sundberg, 1974, 2001; Bloothooft and Plomp,
1986b). This vocal feature has the further advantage that
orchestras have relatively little power in this range,
which might allow opera soloists to project or to be
heard above a large orchestra in a large opera hall.
Singers formants are either weaker, not usually observed,
or harder to demonstrate, in women singers (Weiss et al.,
2001). This is not surprising: high voices have wide
harmonic spacing, which makes it hard to define a formant
in the spectrum of any single note. (While one can find a
peak in the time-average spectrum of many notes, this is
not necessarily the same as a formant, because it depends
on which notes are sung.) Further, a resonance in this
range is of less use to a high alto or soprano, because
the wide harmonic spacing allows a resonance to fall
between harmonics. High voices also have the advantage
that the fundamental, usually the strongest harmonic,
falls in the range of sensitive human hearing. If the gain
in a singers vocal tract had a bandwidth of a few hundred
Hz (the typical width of the singers formant), then for
many notes in the high range it would fall between two
adjacent harmonics. Finally, high voices can use resonance
tuning more effectively than other singers, and may
therefore have less need of a singers formant.
Sundberg (1974) attributes the singers formant to a
clustering of the third, fourth and/or fifth resonances of
the tract. (Measuring the resonances associated with it is
an ongoing project in our lab.) Singers produce this
formant by lowering the larynx low and narrowing the vocal
tract just above the glottis (Sundberg, 1974; Imagawa et
al., 1003; Dang and Honda, 1997; Takemoto et al., 2006). A
vocal tract with this geometry should work better to
transmit power from the glottis to the sound field outside
the mouth.
When a strong singers formant is combined with the strong
high harmonics produced by rapid closure of the glottis,
the effect is a very considerable enhancement of output
sound in the range 2-4 kHz i.e. in a range in which
human hearing is very acute and in which orchestras
radiate relatively little power. It is not surprising that
these are among the techniques used by some types of
professional singers who perform without microphones.
Increasing the fraction of power at high frequencies has
a further advantage: at wavelengths long in comparison
with the size of the mouth, the voice radiates almost
isotropically. As the frequency rises and the wavelength
decreases, the voice becomes more directional, and
proportionally more of the power is radiated in the
direction in which the singer faces, which is usually
towards the audience (Flanagan 1960; Katz and
dAlessandro, 2007, Kob and Jers, 1999). So increasing the
power at high rather than low frequencies via rapid
glottal closure and/or a singers formant help the singer
not to waste sound energy radiated up, down, behind and
to the sides.
A number of studies have investigated a speakers formant
or speakers ring in the voice of theatre actors or in the
speaking voice of singers (Pinczower and Oates, 2005;
Bele, 2006; Cleveland et al., 2001; Barrichelo et al.,
2001; Nawka et al., 1997). Leino (1993) observed a
spectrum enhancement in the voices of actors, but of
smaller amplitude than the singing formant, and shifted
about 1kHz towards high frequencies. This was interpreted
as the clustering of F4 and F5. Bele (2006) reported a
lowering of F4 in the speech of professional actors, which
contributed to the clustering of F3 and F4 in an important
peak. Garnier (2007) also reported such a speaker's
formant in speech produced in noisy environment, with a
formant clustering that depended on the vowel.
More about speech and singing
Voice science is a broad and active area of research. The
references quoted in this essay appear below, and below that
is a collection of links.
The essay uses examples from our
research on the voice and our publications
on voice and music acoustics.
This web essay was written by Joe
Wolfe, Maëva
Garnier and John
Smith of the Acoustics
Group at UNSW, 2009.
Parts of it were published in Garnier
et al., 2020.
References
- Adachi, S., & Yu, J. (2005).
Two-dimensional model of vocal fold vibration for
sound synthesis of voice and soprano singing. Journal
of the Acoustical Society of America. 117, 32133224.
- Alku, P., (1991). "Glottal Wave
Analysis With Pitch Synchronous Iterative Adaptive
Inverse Filtering", in Proc. Second European Conf on
Speech Communication and Technology, Genova, Italy.
- Barrichelo, V. M. O., Heuer, R. J.,
Dean, C. M. & Sataloff, R. T. (2001)Comparison of
singer's formant, speaker's ring, and LTA spectrum
among classical singers and untrained normal
speakers, J. Voice, 15, 344-350.
- Baken, R.J. and Orlikoff, R.F.
(2000). Clinical Measurement of Speech and Voice. 2nd
ed. Singular Publishing Group, San Diego, California.
- Barney, A., De Stefano, A., and
Henrich, N. (2007). The effect of glottal opening on
the acoustic response of the vocal tract Acta
Acustica united with Acustica, 93, 1046-1056.
- Behnke E. (1880). The mechanism of
the human voice, 12th ed. London: J. Curwen &
Sons, Warwick Lane, E.C.
- Bele, I. (2006) "The speaker's
formant". J. Voice, 20, 555-578.
- Bjorkner, E. (2006). Why so
different? Doctoral dissertation. KTH, Stockholm.
- Bloothooft, G. and Plomp, R. 1986a.
Spectral analysis of sung vowels. III.
Characteristics of singers and modes of singing. J.
Acoust. Soc. Am. 79, 852-864.
- Bloothooft, G. and Plomp, R. (1986b).
The sound level of the singer's formant in
professional singing. J.Acoust.Soc.Am., 79, 2028-2033.
- Boutin, H., Smith J. and Wolfe, J.
(2015) "Laryngeal flow: how large is the component due
to vertical motion of the vocal folds during the
closed glottis phase?" J. Acoust. Soc. America, 138,
146-149.
- Carlson, R., Granström, B. and Fant,
G. (1970). "Some studies concerning perception of
isolated vowels." STL-QPSR 2-3: 19-35.
- Chen, M.Y. (1997). Acoustic
correlates of English and French nasalized vowels. J.
Acoust. Soc. Am. 102, 2360-2370.
- Childers, D.G., Krishnamurthy A.K.,
(1985). "A critical review of electroglottography".
Critical rev. biomed.l eng., 12, 131-161.
- Childers, D. G. & Lee, C. K.
(1991). Vocal quality factors: analysis, synthesis,
and perception. J.Acoust.Soc.Am., 90, 2394-2410.
- Chu, D.T.W., Li, K., Epps, J., Smith
J. and Wolfe, J. (2013) "Experimental
evaluation of inverse filtering using physical
systems with known glottal flow and tract
characteristics" J.
Acoust. Soc. America. 133,
EL358-362.
- Clark, J. Yallop, C. and Fletcher,
J., An Introduction to Phonetics and Phonology,
Blackwell, Oxford (2007).
- Cleveland, T. F., Sundberg, J. and
Stone, R. E. (2001). Long-term-average spectrum
characteristics of country singers during speaking and
singing. J. Voice, 15, 54-60.
- Dang, J. and Honda, K., (1997).
"Acoustic characteristics of the piriform fossa in
models and humans", J.Acoust.Soc.Am., 101: 456-465.
- Ekholm, E., Papagiannis, G. C. and
Chagnon, F. P. 1998. Relating objective measurements
to expert evaluation of voice quality in western
classical singing: Critical perceptual parameters. J.
Voice 12, 182-196.
- Elliot, S.J., Bowsher, J.M. (1982).
"Regeneration in brass wind instruments", J. Sound
& Vibration 83, 181-217.
- Epps, J., Smith, J.R. and Wolfe, J.
(1997) "A
novel instrument to measure acoustic resonances of
the vocal tract during speech" Measurement
Science and Technology 8, 1112-1121.
- Fant, G. (1960). Acoustic Theory of
Speech Production. Mouton & Co, The Hague,
Netherlands.
- Feng, G. and Castelli, E. 1996. Some
acoustic features of nasal and nasalized vowels: A
target for vowel nasalization. J. Acoust. Soc. Am.,
99, 3694-3706.
- Flanagan, J. and Landgraf, L. (1968).
"Self-oscillating source for vocal-tract
synthesizers", IEEE Trans. Audio and Eletroacoustics,
16, 57-64.
- Flanagan, J. L. (1960). Analog
Measurements of Sound Radiation from the Mouth.
J.Acoust.Soc.Am., 32, 1613-1620.
- Fletcher, N.H. "Autonomous
vibration of simple pressure-controlled valves in
gas flows" J. Acoust. Soc. Am. 93: 2172-2180,
1993.
- George, N. A., de Mul, F. F. M., Qiu,
Q., Rakhorst, G., and Schutte, H. K. (2008).
Depth-kymography: high-speed calibrated 3D imaging of
human vocal fold vibration dynamics, Phys. Med. Biol.
53, 26672675.
- Garcia M. (1855). Observations on the
human voice. In: Proc. Royal Soc. London, p. 399-410.
- Garnier, M. (2007). Communication in
noisy environments: from adaptation to vocal
straining. Ph.D thesis, University of Paris 6.
- Garnier, M., Henrich, N.,
Castellengo, M., Sotiropoulos, D. and Dubois, D.
(2007). "Characterisation of Voice Quality in Western
Lyrical Singing: from Teachers's Judgements to
Acoustic Descriptions". J. Interdisciplinary Music
Studies 1(2): 62-91.
- Garnier, M., Henrich, N., Smith, J.
and Wolfe, J. (2010) "Vocal
tract adjustments in the high soprano range" J.
Acoust. Soc. America. 127,
3771-3780.
- Garnier, M., Henrich, N.,
Crevier-Buchman, L., Vincent, C., Smith, J. and Wolfe,
J. (2012) "Glottal
behavior in the high soprano range and the
transition to the whistle registers" J.
Acoust. Soc. America. 131,
951-962.
- Garnier, M., Wolfe, J., Henrich, N.
and Smith, J. (2008). "Interrelationship between vocal
effort and vocal tract acoustics: a pilot study".
Proc. of ICSLP, Brisbane, Australia.
- Garnier, M., Henrich, N., Smith, J.
and Wolfe, J. (2010) "Vocal
tract adjustments in the high soprano range" J.
Acoust. Soc. America. 127,
3771-3780.
- Garnier, M., Henrich, N., Smith, J.
and Wolfe, J. (2020) "The
mechanics and acoustics of the singing voice:
registers, resonances and the source-filter
interaction", in The Routledge Companion to
Interdisciplinary Studies in Singing, Volume I:
Development, Russo, F.A., Ilari, B., Cohen,
A.J. editors. Taylor & Francis.
- Gauffin, J. and Sundberg, J. (1989).
"Spectral correlates of glottal voice source waveform
characteristics." J. Speech and Hearing Research
32(3): 556-565.
- Ghonim, A., Smith, J. and Wolfe, J.
(2007) The
sounds of world English.
- Ghonim, A., Lim, J., Smith, J. and
Wolfe, J. (2013) "Analysis
of an on-line survey of English vowel perception"
Acoustics Australia, 41, 160-164.
- Goldstein, J.L. (1973). "An optimum
processor theory for the central formation of the
pitch of complex tones". J.Acoust.Soc.Am., 54,
1496-1516.
- Hardcastle, W. and Laver, J.D.
(1999). The Handbook of Phonetic Sciences. Blackwell
Handbooks in Linguistics, Wiley-Blackwell.
- Helmholtz, H,L.F. (1877) On the
Sensations of Tone, (trans A.J. Ellis, reprinted
Dover, NY, 1954, 95- 98.
- Henrich, N., Kiek, M., Smith,. J. and
Wolfe, J. (2007) "Resonance
strategies in Bulgarian women's singing",
Logopedics Phoniatrics Vocology, 32, 171-177.
- Henrich, N. (2006). "Mirroring the
voice from Garcia to the present day: some insights
into singing voice registers." Logopedics Phoniatrics
Vocology 31(1): 3-14.
- Henrich, N., d'Alessandro, C., Doval,
B. and Castellengo, M. (2005). "Glottal open quotient
in singing: Measurements and correlation with
laryngeal mechanisms, vocal intensity, and fundamental
frequency." J.Acoust.Soc.Am. 117: 1417-1430.
- Henrich, N., Smith, J. and Wolfe, J.
(2011) "Vocal
tract resonances in singing: Strategies used by
sopranos, altos, tenors, and baritones" J.
Acoust. Soc. America. 129,
1024-1035.
- Hertegard, S., Gauffin, J., Sundberg,
J. 1990. Open and covered singing as studied by means
of fiberoptics, inverse filtering and spectral
analysis. J. Voice, 4, 220-230.
- Hirano M, Vennard W, Ohala J. (1970).
"Regulation of register, pitch and intensity of
voice". Folia Phoniatrica, /22, 1-20.
- Hollien, H. and Michel, J. F. (1968).
"Vocal fry as a phonational register." J. Speech
Hearing Research 11(3): 600-604.
- Itoh, T., Takeda, K. and Itakura, F.
(2002) "Acoustic analysis and recognition of whispered
speech", in Proceedings of ICASSP, vol.1, 389-392.
- Imagawa, H., Sakakibara, K.-I.,
Tayama, N. (2003). "The effect of the hypopharyngeal
and supra-glottic shapes on the singing voice", in
Proceedings of SMAC, Stockholm, Sweden.
- Jeanneteau, M., Hanna, N., Almeida,
A., Smith, J. and Wolfe, J. (2020) "Using
visual feedback to tune the second vocal tract
resonance for singing in the high soprano range",
Logopedics Phoniatrics Vocology, DOI:
10.1080/14015439.2020.1834612
- Joliveau, E., Smith, J. and Wolfe, J.
(2004a) Tuning
of vocal tract resonances by sopranos Nature,
427, 116.
- Joliveau, E., Smith, J. and Wolfe, J.
(2004) "Vocal
tract resonances in singing: the soprano voice",
J.
Acoust. Soc. America, 116,
2434-2439. Copyright (2004) Acoustical Society of
America. This article may be downloaded for personal
use only. Any other use requires prior permission of
the author and the Acoustical Society of America. The
preceeding article appeared in (JASA 116,
2434-2439) and may be found at (J.
Acoust. Soc. America,).
- Johnson, K. (2003). Acoustic and
Auditory Phonetics. 2nd ed.Blackwell, Oxford.
- Kallail, K. J., and Emanuel, F. W.
(1984a). "Formantfrequency differences between
isolated whispered and phonated vowel samples produced
by adult female subjects", J. Speech Hear. Res. 27,
245251.
- Kallail, K. J., and Emanuel, F. W.
(1984b). "An acoustic comparison of isolated whispered
and phonated vowel samples produced by adult male
subjects", J. Phonetics 12, 175-186.
- Katz, B. & D'Alessandro, C.
(2007). Measurement of 3D Phoneme-Specific Radiation
Patterns in Speech and Singing, LIMSI.
http://rs2007.limsi.fr/index.php/PS:Page_14
- Kitamura, T., Honda, K. and Takemoto,
H., (2005). "Individual variation of the
hypopharyngeal cavities and its acoustic effects",
Acoustic Science and Technology, 26(1): 16-26.
- Klatt, D. H. and Klatt, L. C. (1990)
"Analysis, synthesis, and perception of voice quality
variations among female and male
talkers",J.Acoust.Soc.Am., 87, 820-857.
- Kob, M. & Jers, H. (1999).
Directivity measurement of a singer. J.Acoust.Soc.Am.,
105, 1003.
- Kob, M. (2003) Analysis and
modelling of overtone singing in the sygyt style
Appl. Acoust., 65, 1249-1259.
- Kob, M. Henrich, N. Howard, D.,
Herzel, H., Tokuda, I. and Wolfe, J. (2011) "Analysing
and understanding the singing voice: recent progress
and open questions" Current Bioinformatics.
6, 362-374.
- Leino, T. (1993). Long-term average
spectrum study on speaking voice quality in male
actors. Proceedings of SMAC, Stockholm, Sweden,
206-210.
- Lieberman, P., and Blumstein, S.E.
(1988). "Speech physiology, speech perception, and
acoustic phonetics." Cambridge University Press,
Cambridge, UK.
- Lindblom, B. E. F., and Sundberg, J.
E. F. (1971). Acoustical consequences of lip, tongue,
jaw, and larynx movement, J. Acoust. Soc. Am. 50,
1166-1179.
- Matsuda, M. and Kasuya, H.,
(1999)"Acoustic nature of the whisper", in Proceedings
of Eurospeech'99, 133-136.
- Miller, R.L. (1959). "Nature of the
Vocal Cord Wave". J.Acoust.Soc.Am., 31, 6, 667-677.
- Miller, D.G. and Schutte, H.K.
(1993). "Physical definition of the flageolet
register". J. Voice, 7, 3, 206-212.
- Miller DG. (2000). Registers in
singing: empirical and systematic studies in the
theory of the singing voice. Doctoral dissertation,
University of Groningen.
- Nawka, T., Anders, L. C., Cebulla, M.
& Zurakowski, D. (1997). The speaker's formant in
male voices, J. Voice, 11, 422-428.
- Nearey, T. (1989). "Static, dynamic,
and relational properties in vowel perception.".
J.Acoust.Soc.Am.. 85, pp. 2088-2113.
- Novak, A. and Vokral, J. (1995).
"Acoustic parameters for the evaluation of voice of
future voice professionals." Folia Phoniatrica
Logopedica 47: 279-285.
- Petersen, G.E., and Barney, H.L.,
Control methods used in a study of vowels, J.
Acoust. Soc. Am. 24, 175-184 (1952).
- Pinczower, R., Oates, J. (2005)
Vocal Projection in Actors: The Long-Term Average
Spectral Features That Distinguish Comfortable Acting
Voice From Voicing With Maximal Projection in Male
Actors, J. Voice 19, 440-453.
- Rothenberg, M. An interactive model
for the voice source Quarterly Prog. Status Report,
Dept Speech, Music and Hearing, KTH, Stockholm, 22.
1-17 (1981).
- Rothenberg, M. (1973). "A new
inverse-filtering technique for deriving the glottal
air flow waveform during voicing". J.Acoust.Soc.Am.,
53, 6, 1632-1645.
- Roubeau B, Castellengo M, Bodin P,
Ragot M. (2004). "Laryngeal registers as shown in the
voice range profile". Folia Phoniatrica Logopaedica,
56, 5, 321-33.
- Scherer, R.C. (1991). "Physiology of
phonation: A review of basic mechanics". Phonosurgery:
Assessment and surgical management of voice. 77-93.
- Smith J., Henrich N., Wolfe J. (2007)
Resonance tuning in singing, 19th International
Congress on Acoustics, Madrid, Spain, Sept. 2007.
- Smits, R., ten Bosch, L., and
Collier, R. (1996). "Evaluation of various sets of
acoustic cues for the perception of prevocalic stop
consonants. I. Perception
experiment".J.Acoust.Soc.Am., 100, 3852-3864.
- Steinhauer, K.M., Rekart, D.M. and
Keaten, J. (1992). Nasality in modal speech and twang
qualities: Physiologic, acoustic, and perceptual
differences, J.Acoust.Soc.Am., 92, p. 2340.
- Stevens, K.N. (1999). Acoustic
Phonetics. MIT Press, Cambridge, MA.
- Stone, R., Cleveland, T., Sundberg,
J., Prokop, J. (2003). Aerodynamic and acoustical
measures of speech, operatic, and broadway vocal
styles in a professional female singer. J. Voice, 17,
283-297.
- Sundberg, J. (1974) Articulatory
interpretation of the singing formant,
J.Acoust.Soc.Am. 55, 838-844.
- Sundberg, J., Gramming, P. and
Lovetri, J. (1993) Comparisons of pharynx, source,
formant, and pressure characteristics in operatic and
musical theatre singing, J. Voice, 7, 301-310.
- Sundberg, J. (2001), Level and
centre frequency of the singers formant, J. Voice
15, 176-186.
- Sundberg, J., and Skoog, J. (1997)
Dependence of jaw opening on pitch and vowel in
singers, J. Voice 11, 301-306.
- Sundberg, J. (1970). "Formant
structure and articulation of spoken and sung vowels."
Folia Phoniatrica (Basel) 22(1): 28-48.
- Svec, J., Schutte, H.K. and Miller,
D.G. (1999). "On pitch jumps between chest and
falsetto registers in voice: Data from living and
excised human larynges". The J.Acoust.Soc.Am., 106, 3,
1523-1531.
- Svec, J., Schutte, H.K. (1996).
"Videokymography: High-speed line scanning of vocal
fold vibration". J. Voice, 10, 2 , 201-205.
- Swerdlin, Y., Smith, J. and Wolfe, J.
(2010) "The
effect of whisper and creak vocal mechanisms on
vocal tract resonances" J.
Acoust. Soc. America. 127,
2590-2598.
- Takemoto, H., Adachi, S., Kitamura,
T., Mokhtari, P., Honda, K. (2006). "Acoustic roles of
the laryngeal cavity in vocal tract resonance",
J.Acoust.Soc.Am., 120: 2228-2238.
- Titze, I.R. (1988) "The physics of
small-amplitude oscillation of the vocal folds"
J.Acoust.Soc.Am., 83, 1536-1552.
- Titze, I. (1994) Priniciples of Voice
Production.
- Titze, I.R. (2001). Acoustic
Interpretation of Resonant Voice, J. Voice 15,
519-528.
- Titze, I.R., Bergan, C.C, Hunter,
E.J. and Story, B. (2003). Source and filter
adjustments affecting the perception of the vocal
qualities twang and yawn. Logopedics Phoniatrics
Vocology 28 : 47 155.
- Van Den Berg, J. (1958).
"Myoelastic-aerodynamic theory of voice production".
The Journal of Speech Language and Hearing Research,
1, 3, 227-244.
- Van Den Berg, J., Zantema, J.T.,
Doornenbal, P. Jr. (1957). On the Air Resistance and
the Bernoulli Effect of the Human Larynx.
J.Acoust.Soc.Am., 29, 5, 626-631.
- Vurma, A., Ross, J. (2002). Where Is
a Singer's Voice if It Is Placed Forward?, J.
Voice, 16, 383-391.
- van den Berg, J. W. (1958).
Myoelastic-aerodynamic theory of voice production. J.
Speech and Hearing Research, 1, 227244a
- Weiss, R., W . Brown, J. and Moris,
J. (2001). "Singer's Formant in Sopranos: Fact or
Fiction?" J. Voice, 15, 457-468.
- Wolfe, J. and Smith, J. (2008)
"Acoustical coupling between lip valves and vocal
folds" Acoust. Australia, 26, 23-27.
- Wolfe, J., Garnier, M. and Smith, J.
(2009) "Vocal
tract resonances in speech, singing and playing
musical instruments" Human Frontier Science
Program Journal, 3, 6-23.
- Zhang, Z. (2016). "Mechanics of human
voice production and control". J.Acoust.Soc.Am., 140:
2614-2635.
]]
Links

is a multimedia introduction to voice science.
Examples of how acoustic measurements at the
(open) lips can identify different vocal tract
configurations.
The measured impedance ratios γ(f) are at right. The
particpant produces seven different gestures. The
schematics on the left are cartoon 1D models of the tract
(not to scale and highly simplified). Notice that, for
inhalation (e), the vocal tract and trachea are connected,
which roughly doubles the length of the former. This means
roughly twice as many resonances in a given frequency
range. Some of the same effect is seen, for low
frequencies, in breathy phonation (d). (Figure from
Jeanneteau et al (2020) above.)
This web essay was written by Joe
Wolfe, Maëva
Garnier and John
Smith of the Acoustics
Group at UNSW.
Parts of it were published in Garnier
et al., 2020.
|