[The CEE Flag][Joint Research Center European Commission][Institute for Systems, Informatics and Safety]
Voice logo

Kepler University
Speech Input Technology for People with Disabilities

Home
Objectives
Projects
Events
Schools
Forum
Faq
Contacts
Mappa
< English >

DE-4104 VOICE PROJECT ACCOMPANYING MEASURE
DISABLED USERS' NEEDS FOR SPEECH INPUT AT HCI
by Klaus Miesenberger, University of Linz, Austria

This Report in WORD format

Introduction
State of the Art of Speech input Technology – Increasing Usability
The VOICE Project and the "VOICE meeting" system
Users' Needs Analysis of Speech input systems
Users' Needs Analysis of the "VOICE meeting" system
Conclusions and Recommendations

Introduction

Due to the increasing power and the decreasing costs of PCs speech input technology has witnessed significant progress the last years. Speech input systems are used for text and command input for Human-Computer Interaction (HCI). In addition to that possibilities to support the interpersonal communication are sorted out. Especially for users with disabilities having problems in using standard interaction methods and problems with interpersonal communication new and efficient tools can be found.

The EU funded project "VOICE", co-ordinated by JRC-ISIS - Joint Research Centre of the European Commission, Institute for Systems, Informatics and Safety – provided the framework for research on this topic. Beside JRC- ISIS the University of Linz, Austria and FBL S.r.l. Information Technologies, Mortara, Italy took part in the project. The main objective was to raise awareness for the capabilities of speech input for people with disabilities, especially for hard of hearing and deaf people. Materials and deliverables are available on the Internet and in printed format at: http://voice.jrc.it

The following pages outline the results of a survey done to find out the level of usage and usability of speech input systems for people with disabilities. Although the potential of this technology seems to be obvious application in practice seems to be rather low. This asks for users' needs analysis to sort out usability problems. The centre of gravity will be given to the access to spoken language of hard of hearing and deaf people. A prototype of a subtitling system using of the shelf speech recognition engines was developed and tested.

Aknowledgements

This work was performed in the framework of the VOICE project (DE4104 - Disabled and Elderly funded by the European Commission, DG Information Society, by means of its Telematics Applications Programme (TAP). This report is a summary of Deliverable 4.1 and 4.2 of the VOICE project.

State of the Art of Speech Input Technology -
Increasing Usability of Speech Recognition Systems

Speech is the most important, most complex and most human tool for communication and interaction. We build our understanding of the world and of each other on verbal language based interaction and communication. Therefore the possibility to use speech for HCI, for putting in text and commands has been part of wishful science fiction and is getting more and more an exciting challenge for engineering and interdisciplinary research.

Verbal communication goes far beyond a simple transfer of information. When interacting with a machine one gets to know that more than talking to a microphone is required. Even the smallest differences to human communication will be recognised and will influence usability. The whole environmental setting is influenced when speech input is employed. This leads to a mismatch between users' estimations, expectations and practical application of speech input systems. Although the quality of these systems is already very high and appreciated, users, after short trials, go back to standard modes of interaction. Speech input meats with users' fascination and initial enthusiasm but not with their needs in everyday practice.

Voice recognition is a step forward towards a user friendlier access and usage of computers. This gets most important for those users who are not able to use standard interaction methods either because of functional limitations in special situations (e.g. a car driver who has to keep his hands on the wheel when he wants to fulfil other tasks like handling the radio, the GPS device, the mobile telephone or other) or because of functional limitations caused by an impairment. Often not more than an alternative for ordinary users, speech input could become an important tool for "extra-ordinary" users.

Although the basic concepts of speech input technology were defined already in the 70ies and 80ies useful applications got available only in the 90ies. Several improvements of the Man-Machine Interaction (MMI) and the usability of speech recognition systems could be achieved.

10 years ago

5 years ago

today

recognition word/min

> 20

> 60

>160

style

disjointed

continuous

continuous

hardware

special

special

standard

vocabulary

> 5000
finite

> 10000
extensible

> 300000 (active), > 750000 (passive), extensible

training

> 1 hour

> 30 min

< 10 min

price

~ 3000

< 1000

< 150

  • Figure 1: Parameters of speech input systems

The step from recognition of disjointed speaking to continuous speaking has been the starting point for a wide area of applications. Today text input via fluent speech into standard applications or commanding and controlling the desktop is possible.

  • Recognition rates above 99,5%, a speed of more than 160 words per minute and a vocabulary of more than 160.000 words are standard.
  • Voice recognition packages include very general vocabularies. Therefore facilities to extend the vocabulary in an easy way are included.
  • Additionally to that systems can check several documents representing the personal language style of a user to learn and add words automatically.
  • Up to date systems are able to detect the differences between commands and dictation of text automatically. The user can also switch between "command mode" and "dictate mode".
  • Voice recognition systems are only in very restricted parts vocabulary and speaker independent. The user has to train the system by reading a story the system "knows". This input is used to develop a personal voice model. The time necessary for training was brought down from more than one hour to less than 10 minutes. This increases the usability especially for those having difficulties with the dialogues and stories used in the training process.
  • Each speech input is used to brush up the recognition quality by recalculating the speech model.
  • Speech recognition can be used to handle and to input in almost any standard computer application. Macros allow adapting applications to be handled via speech.
  • High quality headset microphones (to keep hands free) are part of product packages.
  • Recognition of pre-recorded speech using standard or special dictating devices is possible.
  • Speech output functionalities help to control and proof-read the recognised speech.
  • If mistakes occur comfortable correction functions and dialogues (wizards) are available.

Today the following most critical areas for achieving good recognition results should be taken into account:

  • quality of the microphone used
  • quality of the sound card
  • the power of the PC used
  • the background noise in the environment
  • the changing condition of a speaker

In the future speech input will be used, beside dictating and commanding, for language/speaker recognition, speech understanding, non-speech sound and object recognition offering additional interesting perspectives for HCI.

Where you can look for further information:

The "VOICE meeting" system

The centre of gravity in the "VOICE" project was given to an easier and more complete access to spoken language for hard of hearing and deaf people. A special system, first called "VOICE prototype" and now "VOICE meeting" was developed. This system provides an interface to speech using subtitles. The "VOICE meeting" system is a multimedia workstation applying, dependant of the situation of use, a lot of technology like powerful multimedia PC, video camera, video recorder/TV and wall projector.

Off the shelf speech input packages are a basically assistive devices for a hard of hearing and deaf people. But systems are designed to input text into applications. This leads to fundamental usability problems when speech input should be used for communication purposes. Recognised text normally is displayed in windows designed for a single user fulfilling other tasks and not for communication. To keep the attention on the communication any disturbance should be avoided. Interfaces to other applications should be hidden. Only when necessary interfaces to the recognition engine, the communication tool and other applications should be displayed. An other basic requirement of such a tool is the need that the recognised text and other media as descriptive images (e.g. pictures, slides, graphs) and videos, especially the face of the speaker, should get as much as possible space on the screen.

When a sense is missing or restricted switching between available methods of access (e.g. from speaker's lips to recognised text and vice versa) gets most important. Deaf and hard of hearing listeners therefore also want to lip-read during subtitled communication. Problems occur when speakers move around in the room what changes the distance between speakers' face and the subtitles. Therefore the system can integrate live video of the speaker. Lip reading lets some uncertainty in interpreting the words and is – dependent on the awareness of the speaker – often interrupted. Errors in subtitles may lead to misunderstandings. A combination of lip reading and subtitles could reduce the danger that users may loose track during a presentation.

Voice meeting supported presentation [D] "VOICE meeting" supported presentation

Starting from these basic requirements a subtitling programme was developed. "VOICE meeting" provides an interface to speech using subtitles and is tailored for purposes like

  • subtitling conversation, speeches and presentations (wall projection)
  • subtitling in class (screen or wall presentation)
  • subtitling of TV and video
  • subtitling of telephone calls.

Voice meeting supported video telephony[D] "VOICE meeting" supported video telephony

"VOICE meeting" reserves the whole screen for communication. A simple one-line toolbar provides the interface to all necessary interactions with the speech recognition engine (e.g. microphone on/off, speaker profile, language) as well as with the "VOICE meeting" software. This menu pops up on the screen when the mouse cursor passes the area of the interface on the desktop. Therefore "VOICE meeting reserves the whole interface to subtitles and images.

The basic functionalities described above turn voice recognition engines into a subtitling system. The interface can be defined in a way, which suits the needs of a certain situation best. The number of lines and their length can be defined. The text is displayed in a textbox on a definable background colour. Font, colour and size of text can be defined. Words are never split up in two lines (no hyphenation) to improve readability.

An important feature is the setting of the duration, which the text has to remain on the screen for reading. This asks for a certain style of speaking when the audience involves listeners using or being dependent on subtitles, especially hard of hearing and deaf people. A short pause (greater than 250 msec) should be made which tells the system to complete the recognition process (an other 250 msec). Short sentences should be used to allow participants also to lip-read. For each speaker it will be necessary to use such a time span to be able and to allow to co-ordinate speaking, recognition, displaying, perception and understanding. Such a style of speaking also allows the speaker to proof-read the recognised text and, if necessary, to repeat some words that have not been recognised correctly. If necessary the keyboard may be used to type some specific words.

Voice meeting screen options dialogue [D] VOICE meeting screen options

An additional important feature is the possibility to save recognised text (reference, reports, minutes). Especially for the target group of hard of hearing and deaf people this offers at least access "a posteriori" to fragmentary spoken language. Additional functionalities enable the user to integrate slides, images, videos, … by simply using predefined voice commands. Parameters like duration of appearance or keywords to disappear are included.

Users' Needs Analysis of Speech Input Systems

Almost any article on speech technology outlines the high potential of speech recognition as a substitution or addition to standard interaction methods to input text or commands at the MMI. These systems would make flexible and efficient HCI possible. A significant number of articles mention the high potential of speech input technology for people with disabilities. Those having problems with standard interaction methods would benefit most of this technology. This seems to be obvious and of high practical value at a first glance. Nevertheless an intensive and broad usage cannot be found in practice. Although available at a low price and with high quality speech recognition only got widely used in some specific areas (e.g. medicines, telephone operator).

The VOICE project was confronted with and gave evidence to this gap between potential and use. Users' needs analysis was performed to find out reasons for that and possibilities to overcome this situation. The state-of-the-art in usability research was evaluated. A focus was given to usability, user centred design and design for all for people with disabilities. The body of knowledge available in most parts refers on the process of the development and/or evaluation of HCI. It does not refer on the specific situation of application and the impact on the organisational and environmental settings.

Following this study several tests of speech input systems and the "VOICE meeting" system with users with disabilities and experts in the field were performed. Questionnaires and interview/discussion guides were used as well as notes were taken in all tests, training and presentations. The main results may be resumed in the following points:

  • Unexpected recognition quality and prices
    Quality of speech input systems is much higher and prices are lower than expected.

  • Integration into HCI
    Systems are already very well integrated into the standard MMI of PC workstations and standard applications. People are often not aware of this.

  • Unchanging practice
    Nevertheless these positive impressions do not lead to usage in practice. The unfamiliar and demanding style of interaction via speech and the differences compared to "human-human" communication make users prefer other methods of input.

  • Practice – more than simply speaking
    Users expect to sit down and to speak to the machine. The use of speech input asks for high quality and usability. The smallest differences compared with human communication will be recognised. Often the fact that speech input does not work exactly in the way, that certain behaviour and demanding and sometimes tiring concentration are necessary, leads, after an initial enthusiasm, to disappointment.

  • Assistive device – lack of awareness
    In the field of assistive technology the survey brought forward a more problematic situation. It is recognised that speech input has a high potential as an assistive device. Nevertheless there seem to be only few applications. Users, after an enthusiastic training, do not perform the demanding and tiring process of getting expertise.

  • Users expectations but not users needs - need for practice
    The evaluation made evident that most of the time users base their decision concerning speech input on their estimations out of short-term training. First estimations and user's needs differ considerably. The potential and the needs for practice can only be discovered after a longer period of use.

  • Rethinking practice
    Applying speech input asks for rearrangements in the environmental settings. Without these rearrangements the potential of speech input cannot be encountered.

  • Need for more information and guidance
    Users would be in favour of information, guides of best practice, case studies and other resources on how to arrange the organisational and environmental settings. This could help to avoid wrong estimations and frustration after the training and a short trial.

  • Training process
    The biggest problems encountered when people with disabilities, elderly and very young people start to use speech input are related to the training process. The dialogues of the training should be made more adaptable in order to be able to take the specific needs of the users into account. The stories, which have to be red, are often complicated, demanding or tiring. More possibilities to chose and to adapt these stories would be very much appreciated. The arrangement of the training dialogue often causes problems to visually handicapped and blind people. Stories and dialogues should be adaptable according to age, disability and skills.

  • More than text input
    Although different user groups with disabilities might have big problems in using the system to dictate text, these systems are very suitable as an alternative or additional device for commanding by making use of a restricted set of keywords. People – also experts – are often not aware of these possibilities.

  • Handling the HCI – "relearning" the desktop
    Users encounter that commanding is a different approach to HCI than point and click methods. Knowledge on keywords is needed. Users do not expect the need for "relearning" the interface. Generally basic skills in handling a HCI are a prerequisite.

  • Disturbance of concentration
    Psychology shows that interrupting concentration leads to a high rate of forgetting in short-term memory. When speaking on a certain subject is interrupted by a message of the system or a recognition error this may tend to problems in going on with dictating, commanding or working. People should be aware of this basic usability problem. Users should not concentrate on errors especially when dictating complicated subjects.

  • Keywords – need for on-screen support
    According to computer skills available there is a need to get and to change the intense of system support and help. "What can I say" functionalities are getting available. Context sensitive lists of commands as on-screen help are very much appreciated as well as outputs of commands in accessible formats (Braille, large print or digital).

  • Avoiding cognitive overload
    Due to the usage of an additional channel for input the complexity of the system increases. Simplification and reduction should help to reduce cognitive overload. E.g. users are confused when speech output is used in combination with speech input.

  • Only input support is taken into account
    Generally people expect improvements for mobility and movement-impaired users. People, also experts, are not aware of the potential for other groups of people with disabilities, e.g. hard of hearing and deaf, blind (e.g. keeping hands free for reading) and visually handicapped (e.g. navigation support), mentally retarded and elderly users (e.g. commanding to handle the environment).

  • Unexpected recognition of unusual speech
    People are not aware how flexible speech input systems adapt to unfamiliar styles of speaking. Tests were performed with people with severe speech impairments and hard of hearing and deaf users speaking with an unusual style. Their speech was recognised although a human might hardly understand.

  • Experts' expectations and not users' need
    Decisions on using speech input for people with disabilities are often based on opinions of experts and their experiences or estimations with speech input. Due to that the real needs' of users and the potential is not discovered.

Users' needs analysis of The "VOICE Meeting" System

The usability of speech input systems are also applicable for "VOICE meeting", which uses standard speech recognition engines. The needs concerning the presentation situation will be discussed here.

  • Demand for such a system
    A system to support the communication with hard of hearing and deaf people meets with a demand of the target group.

  • Notes – very much appreciated
    Access to a "protocol" of a speech is a big help for hard of hearing and deaf listeners.

  • Addition and not replacement of traditional communication methods
    Users express their concern that such systems must not be seen as a replacement for sign language or lip reading. There is concern that such a system might be used as an excuse to reduce efforts in traditional deaf and hard of hearing communication methods.

  • Time lag between speaking and reading
    Although the speed and accuracy of speech recognition is very high even the remaining small time lag between speaking and the presentation of subtitles leads to problems. Listeners have to decide if they should go for lip-reading or subtitles. Or the speaker has to pay attention in doing longer pauses for allowing the use of both means.

  • Lot of technology, demanding preparations
    The efforts to prepare all the technology needed, to train the system and to outline the speech according to the system are very high. Speakers are often not willing to do that.

  • Special style of presenting
    Also the need to behave as the system prescribes makes speakers doubtful.

  • Disturbing concentration
    Already few mistakes disturb the presentation and concentration of listeners and speakers. Especially people are concerned that this system puts the focus away from the contents on to the system.

  • External factors
    People are concerned that the system still depends too much on many external factors.

  • Complex language
    Spoken sentences are too complex for deaf users to be red due to their special language usage. Speakers should be prepared to adapt a style of speaking which suits the needs of deaf listeners. Also the fact that language skills differ a lot in the target group makes the application in practice difficult.

  • Help for subtitling
    The VOICE project was able to show that the system could be a valuable tool to support subtitling of TV broadcasts, presentations and videos.

  • Telephone subtitling
    Using speech input on the telephone line still seems to be very critical due to the low and changing quality of the speech signal. However, subtitling of conversations via video-telephones gives good results.

Conclusions and Recommendations

Being closely related with language, the most complex and most human tool on which we base our understanding of each other and of reality, speech recognition has to be applied very carefully.

Rather often users and experts, although enthusiastic at the beginning, do not take the necessary efforts into account. Training and practice needed to reach a level where one really could benefit from this technology is not estimated. There are wrong estimations about an easy to use technology. There is a lack of awareness on the needs for practice. Information, case studies, practical examples and methodological guides should help to avoid wrong estimations and frustration in order to put the high potential into practice.

People are concerned that a system like "VOICE meeting" would be really applicable in practice. This is estimated not because of the technology but because of the preparation needed and the standard style of interaction and communication, which is interrupted by this a technology. There are of course special areas for useful application (e.g. special school, TV subtitling, taking notes).

This shows that although a considerable body of knowledge is available concerning accessibility and usability of HCI generally and for people with disabilities especially there is a need for further research and engagement to put the potential of speech input systems and systems like "VOICE meeting" into action.

In any case systems like "VOICE meeting" have to be seen as first steps towards a user friendlier and more adaptable use of spoken language at the MMI. All problems encountered should be seen as invitations for further research and development activities.

February 2001
Klaus Miesenberger

This Report in WORD format

Top of page

<Home> <Projects> <TAP> <User Requirements>