Dennis agista

Minggu, 26 Februari 2012

PHONETIC INTRODUCTION


                INSIGHTS   INTO   SPOKEN   LANGUAGE   GLEANED   FROM
        PHONETIC   TRANSCRIPTION   OF   THE   SWITCHBOARD   CORPUS

                                              Steven Greenberg, Joy Hollenback, Dan Ellis

                                                       University of California, Berkeley
                                                  International Computer Science Institute
                                               1947 Center Street, Berkeley, CA 94704 USA

                            ABSTRACT                                              situations    as  encapsulated   by   the  TIMIT, ATIS and Wall  Street
                                                                                  Journal   corpora.   For   the   latter  three  corpora,   performance      by
Models   of    speech  recognition     (by    both  human    and  machine)        automatic   speech   recognition   systems       typically  range  between
have    traditionally    assumed      the  phoneme       to  serve    as   the    85 and 98% correct. In contrast,         material  from   the  Switchboard
fundamental       unit   of phonetic     and   phonological        analysis.      corpus is typically recognized with only 40-60% accuracy.
However,  phoneme-centric          models     have   failed   to   provide    a
convincing      theoretical   account     of   the  process  by   which    the    In   the  Switchboard      corpus   two   individuals   discuss   a specific
brain   extracts   meaning   from   the   speech   signal  and  have    fared     topic, such as  summer vacations,          professional   dress   codes,   the
poorly     in  automatic   recognition     of  natural,   informal   speech       international      political   situation,  credit    cards and    so   on  for
(e.g.,   the   Switchboard   corpus).                                             several     minutes    over   the telephone.     The   dialog     contains    a
                                                                                  significant   proportion   of   "filled   pauses"   (e.g., "um,"  "uh-huh",
Over   the    past   five   months     the   Switchboard     Transcription        etc.),    "misarticulations"       (e.g.,  transpositions       of   specific
Project     has   phonetically      transcribed      a   portion     of    the    phonetic        segments),       phonetic     and     lexical      deletions
Switchboard corpus   in   an   effort   to   better   understand   the   failure  ("University of Nebraska" being pronounced [yuw nix   ver   six   n
of phoneme-centric models for   machine  recognition              of  speech,     dix bclbrae skclkaeq], where the "of" is entirely deleted,  and the
as well   as   to   provide   a database   through   which   to   improve   the   final    syllable    of  "University"     [dix]    delayed   till after    the
performance of recognition   systems         focused on  conversational           initiation   of   the   nasal   consonant   in   "Nebraska").
dialogs.

Transcription      of   spoken   dialogs   illustrates   the  pitfalls   of  a    Often, only the   vaguest   hint     of   the  "appropriate"   spectral  cues
phoneme-based   system.        Many    words   are  articulated    in  such   a   are  present    in  the  spectrographic       representation.    Typically,
fashion     as   to  either   omit    or  significantly     transform      the    formant  transitions     usually   associated     with  specific  segments
phonetic   properties   of   phonemic   constituents,   thus   resulting    in    (such   as   liquids   or   nasals)  are either   entirely  missing  or  differ
wide variation      of  word pronunciations.      Often,   only   the barest      appreciably      from    the  patterns     observed     in more    formally
hint   of   a   segment  is realized  phonetically,      in spite    of  good     articulated     speech.   Such    deviations     from     the   "canonical"
intelligibility.                                                                  phonetic   representation   pose   a   significant    challenge    to   current
                                                                                  models   of   speech   recognition.
Despite this large  variability       in   phonetic  realization   of   words,
the   temporal   properties    of   speech segments,      both  phones     and       2.    SWITCHBOARD                      TRANSCRIPTION
syllables, appear to conform to   regular patterns.         This  temporal
regularity   suggests    that  much of  the  linguistic     information     in                                  PROJECT
speech may be   signaled   through  variations         in   amplitude,  pitch
and the coarse spectrum, and that such patterns may   be   useful in              In order to more fully characterize the phonetics of  spontaneous
the   design   of   future-generation   speech   recognition   systems.           speech,      seventy-two      minutes     of   the   Switchboard      corpus
                                                                                  (comprising   portions   of   618   conversations   from      750  speakers,
                                                                                  representing     both   genders,  and spanning       a wide range   of   adult
                    1.     INTRODUCTION
                                                                                  ages    and   dialectal    patterns   from     American     English)    were
                                                                                  phonetically      transcribed     by   a   group    of  eight   Linguistics
Models     of   speech    perception   and   recognition     focus    on   the    students   (7   undergraduates and 1   graduate student)   all   of  whom
phone(me) as the basic representational unit from which  lexical                  had  received   previous     training    in  phonetic    transcription     and
units   are  ultimately     derived.   Although    this   representational        general  phonetics/phonology           at  the   University  of California,
model      often    provides     an   adequate     (if    not   completely        Berkeley. The transcribers were   closely         supervised   by   both   the
comprehensive)   descriptive        basis for   the   acoustics   of  carefully   senior   author  and  Professor      John  Ohala   in   order   to  insure  as
articulated   and  read  speech,     it  fails  to capture    many     of  the    accurate   and    as uniform     a transcription      of the   materials    as
spectro-temporal       properties   of   spontaneous     speech   typical   of    possible. Specific transcription issues were discussed   at  weekly
informal   spoken   dialog.                                                       project    meetings,    using    a   60" BARCO  projection       screen    for

                                                                                  computer display and audio feedback.
The  Switchboard       corpus    provides    an  excellent    test-bed  with
which     to  compare     the phonetic      properties     of spontaneous         The   phonetic   transcriptions  were encoded with   a variant   of   the
speech     with   those    characteristic    of more      formal   speaking       Arpabet  transcription       system  used  for   the TIMIT  corpus.     This

                 Proceedings of the International  Conference on Spoken Language Processing, Philadelphia, 1996

PHONETIC&PHONOLOGY


Phonetics vs. Phonology
1. Phonetics vs. phonology
Phonetics deals with the production of speech sounds by humans, often without prior knowledge of the language being spoken. Phonology is about patterns of sounds, especially different patterns of sounds in different languages, or within each language, different patterns of sounds in different positions in words etc.
2. Phonology as grammar of phonetic patterns
  • The consonant cluster /st/ is OK at the beginning, middle or end of words in English.
  • At beginnings of words, /str/ is OK in English, but /ftr/ or /tr/ are not (they are ungrammatical).
  • /tr/ is OK in the middle of words, however, e.g. in "ashtray".
  • /tr/ is OK at the beginnings of words in German, though, and /ftr/ is OK word-initially in Russian, but not in English or German.
3. A given sound have a different function or status in the sound patterns of different languages
For example, the glottal stop [] occurs in both English and Arabic BUT ...
In English, at the beginning of a word, [] is a just way of beginning vowels, and does not occur with consonants. In the middle or at the end of a word, [] is one possible pronunciation of /t/ in e.g. "pat" [pa].
In Arabic, // is a consonant sound like any other (/k/, /t/ or whatever): [íktib] "write!", [daíia] "minute (time)", [a] "right".
4. Phonemes and allophones, or sounds and their variants
The vowels in the English words "cool", "whose" and "moon" are all similar but slightly different. They are three variants or allophones of the /u/ phoneme. The different variants are dependent on the different contexts in which they occur. Likewise, the consonant phoneme /k/ has different variant pronunciations in different contexts. Compare: 
 
keep
/kip/
The place of articulation is fronter in the mouth
[k+h]
cart
/kt/
The place of articulation is not so front in the mouth
[kh]
coot
/kut/
The place of articulation is backer, and the lips are rounded
[khw]
seek
/sik/
There is less aspiration than in initial position
[k`]
scoop
/skup/
There is no aspiration after /s/
[k]
These are all examples of variants according to position (contextual variants). There are also variants between speakers and dialects. For example, "toad" may be pronounced [tëUd] in high-register RP, [toUd] or [tod] in the North. All of them are different pronunciations of the same sequence of phonemes. But these differences can lead to confusion: [toUd] is "toad" in one dialect, but may be "told" in another.
5. Phonological systems
Phonology is not just (or even mainly) concerned with categories or objects (such as consonants, vowels, phonemes, allophones, etc.) but is also crucially about relations. For example, the English stops and fricatives can be grouped into related pairs which differ in voicing and (for the stops) aspiration: 
 
Voiceless/aspirated
ph
th
kh
f
s
h
Voiced/unaspirated
b
d
v
z
ð
(unpaired)
Patterns lead to expectations: we expect the voiceless fricative [h] to be paired with a voiced [], but we do not find this sound as a distinctive phoneme in English. And in fact /h/ functions differently from the other voiceless fricatives (it has a different distribution in words etc.) So even though [h] is phonetically classed as a voiceless fricative, it is phonologically quite different from /f/, /s/, // and //.
Different patterns are found in other languages. In Classical Greek a three-way distinction was made between stops: 
 
Voiceless/aspirated
ph
th
kh
Voiced/unaspirated
p
t
k
Voiced (and unaspirated)
b
d
In Hindi-Urdu a four-way pattern is found, at five places of articulation: 
 
Voiceless aspirated
ph
th
h
ch
kh
Voiceless unaspirated
p
t
c
k
Voiced unaspirated
b
d
etc.
Breathy voiced ("voiced aspirates")
b
d
etc.
6. Shapes of vowel systems: some common examples: 
 
Triangular: 
(e.g. Arabic)
3 vowels


Triangular: 
(e.g. Japanese)
5 vowels

i

u

i

u




e

o

a



a








Triangular: 
(e.g. Tübatulabal)
6 vowels


Triangular: 
(e.g. Italian)
7 vowels

i
u

i

u
e

o

e

o






a



a








Triangular: 
(e.g. Bulgarian)
6 vowels


Rectangular: 
(e.g. Montenegrin)
6 vowels

i

u

i
u

e
o

e
o


a


a

How many degrees of vowel height are there in Bulgarian? On the face of things, it appears to be not very different from Tübatulabal, which has three heights: three high vowels, two mid vowels and one low vowel. But if we look more closely into Bulgarian phonology, we see that the fact that schwa is similar in height to /e/ and /o/ is coincidental: the distinction that matters in Bulgarian is /i/ vs. /e/, /u/ vs. /o/ and // vs. /a/, i.e. relatively high vs.relatively low. As evidence for this statement, note that while all six vowels may occur in stressed syllables, only /i/, /e/, // and /u/ occur in unstressed syllables.
7. Phonology as interpretation of phonetic patterns: Fang (Bantu: Cameroon, Gabon, Equatorial Guinea) 
 

Fang
English


Fang
English
1)
etf-
shoulder

7)
tm
branch
2)
vbi,v-bi
hippopotamus

8)
bikq
back teeth
3)
ndv()
dam

9)
eln
water tortoise
4)
kf-l
tortoise

10)
fq
bag
5)
kf-
salt

11)
t
neck
6)
kl
rope

12)
osn
squirrel
Vowels in corpus: 
 
i
y

?u expected but not found
e

o




a


Further reading
Lass, R. (1984) Phonology: an introduction to basic concepts. Cambridge University Press.
Jakobson, R. (1962) The phonemic concept of distinctive features. In A. Sovijärvi and P. Aalto, eds. Proceedings of the Fourth International Congress of Phonetic Sciences. Mouton & Co. 440-455.
Jakobson, R. and M. Halle (1956) Fundamentals of Language. Mouton.
Kelly, J. (1974) Close vowels in Fang. Bulletin of the School of Oriental and African Studies 37, 119-123.