Dennis agista

Minggu, 26 Februari 2012

PHONETIC INTRODUCTION


                INSIGHTS   INTO   SPOKEN   LANGUAGE   GLEANED   FROM
        PHONETIC   TRANSCRIPTION   OF   THE   SWITCHBOARD   CORPUS

                                              Steven Greenberg, Joy Hollenback, Dan Ellis

                                                       University of California, Berkeley
                                                  International Computer Science Institute
                                               1947 Center Street, Berkeley, CA 94704 USA

                            ABSTRACT                                              situations    as  encapsulated   by   the  TIMIT, ATIS and Wall  Street
                                                                                  Journal   corpora.   For   the   latter  three  corpora,   performance      by
Models   of    speech  recognition     (by    both  human    and  machine)        automatic   speech   recognition   systems       typically  range  between
have    traditionally    assumed      the  phoneme       to  serve    as   the    85 and 98% correct. In contrast,         material  from   the  Switchboard
fundamental       unit   of phonetic     and   phonological        analysis.      corpus is typically recognized with only 40-60% accuracy.
However,  phoneme-centric          models     have   failed   to   provide    a
convincing      theoretical   account     of   the  process  by   which    the    In   the  Switchboard      corpus   two   individuals   discuss   a specific
brain   extracts   meaning   from   the   speech   signal  and  have    fared     topic, such as  summer vacations,          professional   dress   codes,   the
poorly     in  automatic   recognition     of  natural,   informal   speech       international      political   situation,  credit    cards and    so   on  for
(e.g.,   the   Switchboard   corpus).                                             several     minutes    over   the telephone.     The   dialog     contains    a
                                                                                  significant   proportion   of   "filled   pauses"   (e.g., "um,"  "uh-huh",
Over   the    past   five   months     the   Switchboard     Transcription        etc.),    "misarticulations"       (e.g.,  transpositions       of   specific
Project     has   phonetically      transcribed      a   portion     of    the    phonetic        segments),       phonetic     and     lexical      deletions
Switchboard corpus   in   an   effort   to   better   understand   the   failure  ("University of Nebraska" being pronounced [yuw nix   ver   six   n
of phoneme-centric models for   machine  recognition              of  speech,     dix bclbrae skclkaeq], where the "of" is entirely deleted,  and the
as well   as   to   provide   a database   through   which   to   improve   the   final    syllable    of  "University"     [dix]    delayed   till after    the
performance of recognition   systems         focused on  conversational           initiation   of   the   nasal   consonant   in   "Nebraska").
dialogs.

Transcription      of   spoken   dialogs   illustrates   the  pitfalls   of  a    Often, only the   vaguest   hint     of   the  "appropriate"   spectral  cues
phoneme-based   system.        Many    words   are  articulated    in  such   a   are  present    in  the  spectrographic       representation.    Typically,
fashion     as   to  either   omit    or  significantly     transform      the    formant  transitions     usually   associated     with  specific  segments
phonetic   properties   of   phonemic   constituents,   thus   resulting    in    (such   as   liquids   or   nasals)  are either   entirely  missing  or  differ
wide variation      of  word pronunciations.      Often,   only   the barest      appreciably      from    the  patterns     observed     in more    formally
hint   of   a   segment  is realized  phonetically,      in spite    of  good     articulated     speech.   Such    deviations     from     the   "canonical"
intelligibility.                                                                  phonetic   representation   pose   a   significant    challenge    to   current
                                                                                  models   of   speech   recognition.
Despite this large  variability       in   phonetic  realization   of   words,
the   temporal   properties    of   speech segments,      both  phones     and       2.    SWITCHBOARD                      TRANSCRIPTION
syllables, appear to conform to   regular patterns.         This  temporal
regularity   suggests    that  much of  the  linguistic     information     in                                  PROJECT
speech may be   signaled   through  variations         in   amplitude,  pitch
and the coarse spectrum, and that such patterns may   be   useful in              In order to more fully characterize the phonetics of  spontaneous
the   design   of   future-generation   speech   recognition   systems.           speech,      seventy-two      minutes     of   the   Switchboard      corpus
                                                                                  (comprising   portions   of   618   conversations   from      750  speakers,
                                                                                  representing     both   genders,  and spanning       a wide range   of   adult
                    1.     INTRODUCTION
                                                                                  ages    and   dialectal    patterns   from     American     English)    were
                                                                                  phonetically      transcribed     by   a   group    of  eight   Linguistics
Models     of   speech    perception   and   recognition     focus    on   the    students   (7   undergraduates and 1   graduate student)   all   of  whom
phone(me) as the basic representational unit from which  lexical                  had  received   previous     training    in  phonetic    transcription     and
units   are  ultimately     derived.   Although    this   representational        general  phonetics/phonology           at  the   University  of California,
model      often    provides     an   adequate     (if    not   completely        Berkeley. The transcribers were   closely         supervised   by   both   the
comprehensive)   descriptive        basis for   the   acoustics   of  carefully   senior   author  and  Professor      John  Ohala   in   order   to  insure  as
articulated   and  read  speech,     it  fails  to capture    many     of  the    accurate   and    as uniform     a transcription      of the   materials    as
spectro-temporal       properties   of   spontaneous     speech   typical   of    possible. Specific transcription issues were discussed   at  weekly
informal   spoken   dialog.                                                       project    meetings,    using    a   60" BARCO  projection       screen    for

                                                                                  computer display and audio feedback.
The  Switchboard       corpus    provides    an  excellent    test-bed  with
which     to  compare     the phonetic      properties     of spontaneous         The   phonetic   transcriptions  were encoded with   a variant   of   the
speech     with   those    characteristic    of more      formal   speaking       Arpabet  transcription       system  used  for   the TIMIT  corpus.     This

                 Proceedings of the International  Conference on Spoken Language Processing, Philadelphia, 1996

Tidak ada komentar:

Posting Komentar