INSIGHTS INTO SPOKEN LANGUAGE GLEANED FROM
PHONETIC TRANSCRIPTION OF THE SWITCHBOARD CORPUS
Steven Greenberg, Joy Hollenback, Dan Ellis
University of California, Berkeley
International Computer Science Institute
1947 Center Street, Berkeley, CA 94704 USA
ABSTRACT situations as encapsulated by the TIMIT, ATIS and Wall Street
Journal corpora. For the latter three corpora, performance by
Models of speech recognition (by both human and machine) automatic speech recognition systems typically range between
have traditionally assumed the phoneme to serve as the 85 and 98% correct. In contrast, material from the Switchboard
fundamental unit of phonetic and phonological analysis. corpus is typically recognized with only 40-60% accuracy.
However, phoneme-centric models have failed to provide a
convincing theoretical account of the process by which the In the Switchboard corpus two individuals discuss a specific
brain extracts meaning from the speech signal and have fared topic, such as summer vacations, professional dress codes, the
poorly in automatic recognition of natural, informal speech international political situation, credit cards and so on for
(e.g., the Switchboard corpus). several minutes over the telephone. The dialog contains a
significant proportion of "filled pauses" (e.g., "um," "uh-huh",
Over the past five months the Switchboard Transcription etc.), "misarticulations" (e.g., transpositions of specific
Project has phonetically transcribed a portion of the phonetic segments), phonetic and lexical deletions
Switchboard corpus in an effort to better understand the failure ("University of Nebraska" being pronounced [yuw nix ver six n
of phoneme-centric models for machine recognition of speech, dix bclbrae skclkaeq], where the "of" is entirely deleted, and the
as well as to provide a database through which to improve the final syllable of "University" [dix] delayed till after the
performance of recognition systems focused on conversational initiation of the nasal consonant in "Nebraska").
dialogs.
Transcription of spoken dialogs illustrates the pitfalls of a Often, only the vaguest hint of the "appropriate" spectral cues
phoneme-based system. Many words are articulated in such a are present in the spectrographic representation. Typically,
fashion as to either omit or significantly transform the formant transitions usually associated with specific segments
phonetic properties of phonemic constituents, thus resulting in (such as liquids or nasals) are either entirely missing or differ
wide variation of word pronunciations. Often, only the barest appreciably from the patterns observed in more formally
hint of a segment is realized phonetically, in spite of good articulated speech. Such deviations from the "canonical"
intelligibility. phonetic representation pose a significant challenge to current
models of speech recognition.
Despite this large variability in phonetic realization of words,
the temporal properties of speech segments, both phones and 2. SWITCHBOARD TRANSCRIPTION
syllables, appear to conform to regular patterns. This temporal
regularity suggests that much of the linguistic information in PROJECT
speech may be signaled through variations in amplitude, pitch
and the coarse spectrum, and that such patterns may be useful in In order to more fully characterize the phonetics of spontaneous
the design of future-generation speech recognition systems. speech, seventy-two minutes of the Switchboard corpus
(comprising portions of 618 conversations from 750 speakers,
representing both genders, and spanning a wide range of adult
1. INTRODUCTION
ages and dialectal patterns from American English) were
phonetically transcribed by a group of eight Linguistics
Models of speech perception and recognition focus on the students (7 undergraduates and 1 graduate student) all of whom
phone(me) as the basic representational unit from which lexical had received previous training in phonetic transcription and
units are ultimately derived. Although this representational general phonetics/phonology at the University of California,
model often provides an adequate (if not completely Berkeley. The transcribers were closely supervised by both the
comprehensive) descriptive basis for the acoustics of carefully senior author and Professor John Ohala in order to insure as
articulated and read speech, it fails to capture many of the accurate and as uniform a transcription of the materials as
spectro-temporal properties of spontaneous speech typical of possible. Specific transcription issues were discussed at weekly
informal spoken dialog. project meetings, using a 60" BARCO projection screen for
computer display and audio feedback.
The Switchboard corpus provides an excellent test-bed with
which to compare the phonetic properties of spontaneous The phonetic transcriptions were encoded with a variant of the
speech with those characteristic of more formal speaking Arpabet transcription system used for the TIMIT corpus. This
Proceedings of the International Conference on Spoken Language Processing, Philadelphia, 1996