|
Articles and whitepapers
Voice Recognition for Home Control (1/3/2006)
By
David Milward, Linguamatics
For several years, demonstration 'homes of
the future' have shown control of home devices by voice. There are
some obvious benefits. Voice allows hands-free control. This is
vital for people who find keyboards difficult or impossible to use,
but also useful to anyone who wants to control their music, telephone
etc., while doing other tasks, whether this be ironing, cooking
or having a shower.
'Spoken dialogue technology' takes voice
control one step further, allowing a system to interact in a step-by-step
fashion with a user, asking questions and responding to the replies.
This makes advanced functionality of multiple, complex, networked
devices much more accessible, not just to the technically-minded
with the patience to wade through manuals. It also allows people
to phone up their home and access devices remotely, for example
to turn on the heating or air conditioning before returning.
Voice control and spoken dialogue is starting
to become commonplace in cars, not just in top-of-range models.
So what is the potential in the home, what are the component technologies,
and what is needed for it to become equally commonplace?
Potential for Spoken Control in the Home
Although spoken control can be used for simple
individual devices in the same room as the user, the real benefits
are for control of remote devices, of complex or multiple devices,
and interaction with services. Let us consider each of these in
turn.
1. Remote devices
This includes controlling the home by phone from outside, or controlling
devices in other rooms. Monitoring of devices at a remote location,
for example the home of someone with a medical condition, is also
possible. Devices themselves may initiate interactions, for example,
a warning may be relayed over speakers if a device senses an unusual
state, giving a user a chance to override any automatic shutdown.
2. Complex devices
Although it is possible to control complex devices using on-screen
menus on LCD panels, menu structures can be difficult to navigate
for new users. Voice shortcuts allow people to immediately get to
an option, such as a volume setting, without having to know that
'volume setting' is below 'user configuration' for example. It is
possible to access hundreds of devices or services by voice - many
more than can appear as shortcuts on a screen. It is also possible
to skip several steps by setting parameters directly, using a command
such as 'turn the TV volume to 5%'.
3. Multiple devices
Controlling multiple devices and checking their status is very natural
using voice. For example, 'Turn off all the lights' or 'Have I left
anything on?' Although lights and blinds may increasingly use sensors,
homeowners will still want the ability to override and to choose
particular mood settings. The ability to program multiple devices,
for example 'Switch the hall light on when the front door is opened'
is also critical if users are to exploit the benefits of networked
devices
4. Services
It is possible today to access services over the phone by voice,
but these are not tailored to the home or able to interact with
other services from other suppliers. When booking a film and a taxi,
it would be great to book the film then just say 'We need a taxi
to get there' instead of having to re-enter the day, time and destination.
The technology
'Speech recognition' is the process of taking
a speech signal and converting it into words. The technology has
improved steadily, if not spectacularly, over the last few decades,
but it is still not possible to accurately convert anyone's voice
talking about any subject. The recogniser therefore needs to be
trained to one or more specific speakers, i.e. 'speaker-dependent,
large vocabulary' recognition, or restricted in the number of words
it can recognise, i.e. 'speaker independent, small vocabulary'.
'Speech synthesis' is the opposite process
of taking words and creating a speech signal. The quality of synthesis
has improved rapidly in recent years, helped by larger computer
memory. In the best systems, the robot-like voices of the past have
been replaced by very natural-sounding speech, with a choice of
accents.
In addition to converting from words to speech
and vice versa, a system also has to understand what the words mean,
and be able to convert instructions such as 'Turn the living room
light on' or 'Switch on the light in the lounge' into the same command.
This is 'language understanding' or 'natural language processing'.
Finally, if the system is to interact with
the user in a conversation, the system requires a 'dialogue manager'.
In most systems, this is a fixed script describing each step of
an interaction with a user, but more intelligent and flexible systems
are being developed. For example, in ontology-based dialogue, the
dialogue manager calculates what to say next based on its knowledge
of the devices and services (e.g. a light with the ability to switch
between on and off) and the scenario (e.g. a light which is in the
lounge, which is downstairs, which is in the house).
Components required for a home system
The following components are required for
a home system, in addition to the networked home:
1. Microphone(s) and speaker(s). Portable
devices range from Bluetooth headsets or mobile phones, to tablet
PCs with embedded microphones and speakers, as in the picture below.
Array microphones can be used to pick up sound from any position
in a room. Small, fixed microphones and speakers in chosen locations,
such as above a kitchen worktop, provide a cheaper alternative.

Tablet PC with touchscreen, microphone and speakers
2. Speech recognizer/synthesizer. This is
usually software running on a medium- to high-specification PC.
There are systems designed for embedded devices, but quality tends
to be related to memory/disk usage. A telephony card or Voice Over
IP is required for remote use.
3. Language understanding and dialogue management.
Again software running on a PC, but requiring relatively little
memory and processing requirements relative to speech recognition
or synthesis.
4. Connection to devices. Messages need to
be received from, and sent to devices by the dialogue manager. This
may be done using X10 for example, or via an API to a home control
software package.
As well as installing the components, some
configuration, and later reconfiguration, by the home owner is required,
so that the system knows about the rooms in the house, and the location
of the devices.
Current status and future developments
Spoken interaction has the potential to provide
a natural and very flexible way to talk to devices. Systems based
on single commands or scripted interactions, with training to a
speaker's voice are already available as off-the-shelf software
packages. Spoken dialogue systems that provide more flexible interactions
are at the prototype and demonstration home stage. For example,
at the Advantica Test House in Loughborough, a demonstrator built
as part of a DTI-sponsored trial run by TAHI (The Application Home
Initiative) provides natural interactions using a multi-modal interface
as shown in the diagram below.

Multimodal interaction for controlling home devices
Users can read what is on the screen, listen
to instructions, or give voice instructions depending on what is
most convenient. As well as controlling devices, the demonstrator
includes a recipe application that reads out recipe instructions
step by step.
So why is spoken control in the home less
common than in-car control? Firstly, modern cars have networked
devices as standard. Secondly, there are a fixed number of known
devices in the car, so scripting interactions is possible. Thirdly,
the driver is in a fixed position, allowing microphones and speakers
to be carefully positioned to get maximum accuracy.
None of this need prevent spoken systems
from being used in the home now, or in similar environments such
as care homes. However in the next few years, the likely uses are
going to be where there is a real need, such as for the elderly
or disabled, or in expensive homes which are already highly automated.
In the meantime, speech recognition technology is expected to continue
to improve, and research projects such as the EU-sponsored Talk
Project are developing the next generation of fully reconfigurable
systems that will be more suitable for dynamic home environments,
where devices and services are continually changing.
David Milward is the CTO of Linguamatics Ltd,
provider of natural language processing technology, including next-generation
ontology-based dialogue management.
www.linguamatics.com
|