|
Voice
recognition is the
process of taking the
spoken word as an input
to a computer program.
This process is
important to virtual
reality because it
provides a fairly
natural and intuitive
way of controlling the
simulation while
allowing the user's
hands to remain free.
This article will delve
into the uses of voice
recognition in the field
of virtual reality,
examine how voice
recognition is
accomplished, and list
the academic disciplines
that are central to the
understanding and
advancement of voice
recognition
technology.
What
is voice recognition,
and why is it useful in
a virtual environment?
Voice
recognition is "the
technology by which
sounds, words or phrases
spoken by humans are
converted into
electrical signals, and
these signals are
transformed into coding
patterns to which
meaning has been
assigned" [ADA90].
While the concept could
more generally be called
"sound
recognition", we
focus here on the human
voice because we most
often and most naturally
use our voices to
communicate our ideas to
others in our immediate
surroundings. In the
context of a virtual
environment, the user
would presumably gain
the greatest feeling of
immersion, or being part
of the simulation, if
they could use their
most common form of
communication, the
voice. The difficulty in
using voice as an input
to a computer simulation
lies in the fundamental
differences between
human speech and the
more traditional forms
of computer input. While
computer programs are
commonly designed to
produce a precise and
well-defined response
upon receiving the
proper (and equally
precise) input, the
human voice and spoken
words are anything but
precise. Each human
voice is different, and
identical words can have
different meanings if
spoken with different
inflections or in
different contexts.
Several approaches have
been tried, with varying
degrees of success, to
overcome these
difficulties.
How
is voice recognition
performed?
The
most common approaches
to voice recognition can
be divided into two
classes: "template
matching" and
"feature
analysis". Template
matching is the simplest
technique and has the
highest accuracy when
used properly, but it
also suffers from the
most limitations. As
with any approach to
voice recognition, the
first step is for the
user to speak a word or
phrase into a
microphone. The
electrical signal from
the microphone is
digitized by an
"analog-to-digital
(A/D) converter",
and is stored in memory.
To determine the
"meaning" of
this voice input, the
computer attempts to
match the input with a
digitized voice sample,
or template, that has a
known meaning. This
technique is a close
analogy to the
traditional command
inputs from a keyboard.
The program contains the
input template, and
attempts to match this
template with the actual
input using a simple
conditional
statement.
Since
each person's voice is
different, the program
cannot possibly contain
a template for each
potential user, so the
program must first be
"trained" with
a new user's voice input
before that user's voice
can be recognized by the
program. During a
training session, the
program displays a
printed word or phrase,
and the user speaks that
word or phrase several
times into a microphone.
The program computes a
statistical average of
the multiple samples of
the same word and stores
the averaged sample as a
template in a program
data structure. With
this approach to voice
recognition, the program
has a
"vocabulary"
that is limited to the
words or phrases used in
the training session,
and its user base is
also limited to those
users who have trained
the program. This type
of system is known as
"speaker
dependent." It can
have vocabularies on the
order of a few hundred
words and short phrases,
and recognition accuracy
can be about 98
percent.
A
more general form of
voice recognition is
available through
feature analysis and
this technique usually
leads to
"speaker-independent"
voice recognition.
Instead of trying to
find an exact or
near-exact match between
the actual voice input
and a previously stored
voice template, this
method first processes
the voice input using
"Fourier
transforms" or
"linear predictive
coding (LPC)", then
attempts to find
characteristic
similarities between the
expected inputs and the
actual digitized voice
input. These
similarities will be
present for a wide range
of speakers, and so the
system need not be
trained by each new
user. The types of
speech differences that
the speaker-independent
method can deal with,
but which pattern
matching would fail to
handle, include accents,
and varying speed of
delivery, pitch, volume,
and inflection.
Speaker-independent
speech recognition has
proven to be very
difficult, with some of
the greatest hurdles
being the variety of
accents and inflections
used by speakers of
different nationalities.
Recognition accuracy for
speaker-independent
systems is somewhat less
than for
speaker-dependent
systems, usually between
90 and 95 percent.
Another
way to differentiate
between voice
recognition systems is
by determining if they
can handle only discrete
words, connected words,
or continuous speech.
Most voice recognition
systems are discrete
word systems, and
these are easiest to
implement. For this type
of system, the speaker
must pause between
words. This is fine for
situations where the
user is required to give
only one word responses
or commands, but is very
unnatural for multiple
word inputs. In a connected
word voice
recognition system, the
user is allowed to speak
in multiple word
phrases, but he or she
must still be careful to
articulate each word and
not slur the end of one
word into the beginning
of the next word.
Totally natural, continuous
speech includes a
great deal of "coarticulation",
where adjacent words run
together without pauses
or any other apparent
division between words.
A speech recognition
system that handles
continuous speech is the
most difficult to
implement.
What
disciplines are involved
in voice recognition?
The
template matching method
of voice recognition is
founded in the general
principles of digital
electronics and basic
computer programming. To
fully understand the
challenges of efficient
speaker- independent
voice recognition, the
fields of phonetics,
linguistics, and digital
signal processing should
also be explored.
References
[ADA90]
Adams, Russ, Sourcebook
of Automatic
Identification and Data
Collection, Van Nostrand
Reinhold, New York,
1990.
[CAT84]
Cater, John P.,
Electronically Hearing:
Computer Speech
Recognition, Howard W.
Sams & Co.,
Indianapolis, IN, 1984.
[FOU89]
Fourcin, A., G. Harland,
W. Barry, and V. Hazan,
editors, Speech Input
and Output Assessment,
Ellis Horwood Limited,
Chichester, UK, 1989.
[YAN87]
Yannakoudakis, E. J.,
and P. J. Hutton, Speech
Synthesis and
Recognition Systems,
Ellis Horwood Limited,
Chichester, UK, 1987.
|