Partners:
Everis, Spain
ETH, Switzerland
UZH, Switzerland
Freiburg, Germany
MA Systems, UK
Bristol, UK
Xiwrite, Italy
Ultrasis, UK
Jaume, Spain
Valencia, Spain
Lanzhou, China
EU-Grant (FP7):
248544
|
Representing Speech Characteristics
Human speech is greatly influenced by the affective state of the speaker, such as sadness,
happiness, fear, anger, aggression, lack of energy, or drowsiness. Thus, an attentive
listener discovers a lot about the affective state of his partner with no great effort,
and without having to talk about it explicitly during a conversation. In consequence,
psychiatrists routinely monitor speaking behaviour and voice sound characteristics of their
patients for diagnostic purposes and as sensitive indicators of clinical change.
Speaking Behavior and Voice Sound Characteristics
Speech characteristics can be roughly described by a few major features: speech flow, loudness,
intonation and intensity of overtones. Speech flow describes the speed at which utterances are
produced as well as the number and duration of temporary breaks in speaking. Loudness reflects
the amount of energy associated with the articulation of utterances and, when regarded as a
time-varying quantity, the speaker's dynamic expressiveness. Intonation is the manner of producing
utterances with respect to rise and fall in pitch, and leads to tonal shifts in either
direction of the speaker's mean vocal pitch. Overtones are the higher tones which faintly
accompany a fundamental tone, thus being responsible for the tonal diversity of sounds.
Analysis of the Nonverbal Content of Human Speech
Firstly, the individual speech recordings are screened for intervals without signal. These
intervals are then used to determine the thresholds for background noise under consideration of
a certain "guard" zone. Based on these thresholds, time series are subdivided into pauses and
utterances ("segmentation") with pauses of less than 250 msec duration being skipped. In a
second step, "spectra" are calculated on the basis of 1-second epochs by means of a Discrete
Fourier Transformation (DFT: "pure" utterances with pauses having been eliminated for spectral
analyses). Finally, we approximate the shape of the F0 distribution curve ("F0" designates
the mean vocal pitch of a speaker) by a 2nd degree polynomial and use the distance between
the symmetrical -6dB points as a measure of the "F0-variability" (intonation). The ratio
height/width of the 2nd degree polynomial serves as a measure of the "F0-narrowness"
(monotony). The frequency resolution of the DFTs is a quartertone over 7 octaves (55-7040Hz).
|
|
Voice sound characteristics ("timbre") of a male speaker as quantified through
spectral analyses. Spectral intensities are plotted along the y-axis on log-proportional
scales as a function of frequency (x-axis: 7 octaves covering the frequency range of 64-8192Hz).
Mean vocal pitch in females lies 1 octave above that of male speakers.
Depression significantly reduces the dynamic expressiveness of human voices, thus greatly reducing
inter-individual differences. As a direct consequence, the patients' voices become more similar to
each other ("depressive voice"). Voices regain their distinct individuality during recovery.
|