VOICE Project > Events > ICCHP' 98 Vienna Conference: Technical

Abstract: automatic recognition of speech in conversation, conferences, television broadcasts and telephone calls, with their translation into PC screen messages, could be a powerful help for the deaf. The paper presents the technical aspects of the VOICE Project, a European Commission's Telematics Programme Accompanying Measure. The Project is chaired by the Institute for Systems, Informatics and Safety of the Joint Research Centre, in collaboration with Kepler University of Linz, Software Solutions and FBL software houses near Milan, ALFA and CECOEV Associations of the deaf of Milan and the Institute for the deaf of Linz. The hardware interfaces and the software developed by FBL software house, in collaboration with JRC-ISIS laboratory, are presented.

1. The VOICE Project technical aspects;
- 1.1. SoftSol and FBL Software houses;
- 1.2. Design for all;

2. Market Background;

3. The VOICE Laboratory;

4. Results and final goal.

1. The VOICE Project technical aspects

Automatic recognition of speech in conversation, conferences and telephone calls, in order to translate the voice into PC screen messages, could be a very powerful help for the deaf. One of the objectives of the VOICE Project is the development of a demonstrator necessary in generating awareness and stimulating discussion regarding the possible applications of voice to text recognition. The Project proposes not only the promotion of new technologies in the field of voice to text recognition, but also to stimulate and increase the use of new, widely diffused technologies (such as the Internet) with a particular emphasis on the problems that may be encountered by the deaf.

The VOICE Project is chaired by the Institute for Systems, Informatics and Safety of the Joint Research Centre, in collaboration with the Institute for Computer Science of the Johannes Kepler University of Linz, SoftSol and FBL software houses near Milan, Associazione Lombarda Famiglie Audiolesi (ALFA) and Centro Comunicare è Vivere (CECOEV) both of Milan, the Institut für Hör- und Sehbildung of Linz (IHSB).

For a general overview of the Project, please refer to a previous paper (The VOICE Project - Part 1) presented to ICCHP-98 Conference by Giuliano Pirelli, the co-ordinator of the Project. For the user needs analysis and validation of the demonstrator, please refer to an other paper (The VOICE Project - Part 3 - The communication needs of the deaf), presented by Alessandro Mezzanotte, President of CECOEV, to the VOICE Workshop and the VOICE Special Interest User Group Meeting, which will be held during ICCHP-98. The present paper presents the technical aspects (hardware configuration and software developments) of the VOICE Project.

1.1. SoftSol and FBL Software houses

Software Solutions (SoftSol) is a team of professionals with great experience in networking technologies. The Company has also co-ordinated the development, with Master Soft and FBL, of an application for the blind, to enable the interrogation of various telephone directories with vocal orders. The program makes possible the PBX management by blind person using the Dragon Dictate engine for the vocal input, integrated by FBL, whereas Master Soft developed the speech synthesis for the output. SoftSol is in charge of the financial co-ordination of the VOICE Project.

FBL is a qualified distributor of the IBM VoiceType software package and has a long experience as system integrator of Dragon Dictate. With the collaboration of Aries it has gained significant experience in the domain of voice controlled applications for personal computers. In 1992 Aries introduced the first version of Dragon System's dictation software in Italy. This was followed by FBL installing the program on machines for motor deficient users. The company has since then been responsible for hundreds of installations of this type, using the more recent version of Dragon Dictate for Windows. FBL has collaborated with ALFA and ASPHI in giving presentations on the potential of voice to text recognition at several conferences and meetings. FBL collaborates with JRC-ISIS Voice Laboratory for the technical developments of the VOICE Project.

1.2 Design for all

The VOICE Project, according to JRC-ISIS background, TIDE policy and FBL methodology, is developing prototype applications of information technology for people with special needs, using as far as possible hardware and software commonly available on the market. This allows reducing development and maintenance costs, improving the quality of products for the normal market for any user, and eliminating new barriers, which often are created by new information technology tools. Also in the case of the VOICE Project, the design for all products that JRC-ISIS has developed or of which has asked the development to FBL, should have two mains characteristics: ease of use and low price. Experiences show that this goal is possible also with the speech recognition technology, even if at the very beginning this could have seemed too ambitious or just an impossible dream.

The current voice to text recognition packages, produced by companies such as IBM and Dragon Systems, are of very high quality. Their achievements are the result of a huge investment of time, manpower and money. Our goals are to integrate their products into systems for the deaf and to create awareness amongst the manufacturers, the hearing impaired community and service providers. This is being done through the creation of demonstrators and prototype systems, introducing the hearing impaired to the available technologies, showing them the possible applications (so that technical specifications can be laid out) and allowing them to approach companies and government entities themselves.

2. Market Background

The first experience made in Italy using voice to text recognition systems running on standard PC is dated 1992: Dragon Dictate package (Dragon Systems inc., Newton, MA, USA) was running under DOS on a PC with a i80486 processor and an audio card developed by Dragon. The speed was about 20 to 30 words per minute and the price still too high. At the same time IBM developed a speech recognition system on Risk platform. In 1994 two new packages were announced using the Windows operating system platform: Dragon Dictate for Windows and IBM Voice Type. They had the same basic characteristics: disjointed speech, 60 to 70 words per minute, processor i80486, 16 Mb Ram, dedicated or large market audio card, price lower than 1,000 ECU.

At the beginning of 1997, at the time of the preparation of the proposal of the VOICE Project for the Telematics Applications Call, there were just a few products for general use, with some limits in their functionality. The most common products that operated on 486 or Pentium PC's and also offered packages in various languages, were those produced by IBM and Dragon Systems. They were available in various languages, which included English, French, Spanish, German and Italian. No other packages had been released with the same characteristics. The linguistic aspect of the software packages is one that had to be considered due to the European dimension of the project.

Dragon Dictate allowed a user to dictate up to 60 words per minute for people who had trained the system for a few hours. It also let the user guide the mouse pointer across the screen, by means of vocal commands. This system could alter the last recognised word if it felt that it had not been recognised correctly. IBM VoiceType Dictation allowed a user to dictate up to 60 word per minute and worked in groups of three words at a time, which were displayed as highlighted text prior to their confirmation. A dictionary and a series of probabilities (that one word should be followed by an other) were used in checking that recognised words had been correctly understood. One of the difficulties encountered in the use of these systems related to the yet insufficient quality of the recognition. This was in part due to the necessity for the speaker to insert short pauses in between two words (disjointed speech) and also to having to tell the system of punctuation marks.

In July 1997 IBM released in Italy the package Med-Speak for continuous speech recognition. At that time it only contained a dictionary for radiologists. Dragon Systems was also releasing the package Naturally Speaking for continuous speech recognition. The main characteristics of both systems are continuous speech, dictation speed greater than 100 words per minute, high precision (greater than 95%), large dictionaries. The speech recognition engines of both these packages no longer depend on there being pauses between words. This will have a great impact on the quality of voice recognition during conferences and more particularly across telephone lines, since the problems caused by the background noise of the signal during pauses will no longer be present.

3. The VOICE Laboratory

As from the beginning of 1996, a multimedia VOICE Laboratory prototype was installed at JRC-ISIS and FBL started working on the Exploratory Research VOICE Project of JRC-ISIS. The two main applications on which the research was based were the integration of voice to text recognition software into a subtitling system for meetings, conferences, lectures and television, and a system whereby voice to text recognition of conversations across telephone lines could allow a deaf person to be contacted by someone from an ordinary telephone. The functions of the prototype demonstrator are described hereafter.

In the second half of 1996, a prototype demonstrator was developed with just the basic necessary functions required and was presented to the users, stimulating their interest and providing a more precise feed back on their needs. The prototype made use of IBM VoiceType packages in Italian and English languages, with a piece of software developed by FBL for a more user-friendly presentation on the screen. Both IBM VoiceType 3.0 and Dragon Dictate were installed and tested and VoiceType was selected due to its user-friendliness and accuracy both before and after training. The importance of having high degrees of accuracy in both these situations lies in the fact that, whilst you need accurate recognition from a system to be used by a specific speaker in subtitling, you also need a software package that can recognise almost anyone's voice on the telephone.

In July 1997 we performed some tests with the new beta release of continuous speech recognition for the radiologists and in December 1997 we started the tests on the new releases of both IBM and Dragon continuous speech recognition. We improved the prototype demonstrators for the different applications foreseen in the VOICE Project. The introduction of continuous dictation greatly increased the effectiveness and potentials of voice to text recognition, so that in the first quarter of 1998 we could hold several presentations of the prototype demonstrator to Associations of the hearing impaired and schools. The suggestions received by them are helping us to bring the system nearer to the user requirements.

Other than the specific equipment related to voice to text recognition, the VOICE Laboratory hosts the VOICE Project's Web site, which is accessible at all times across the Internet. The aim of this VOICE Forum is that of overcoming the communication problems of the deaf, demonstrating the usefulness of the Internet as a very appropriate means of contacting others and gathering information. A VOICE Discussion Forum and a VOICE Chat Line will also be provided, as a means of collecting and spreading information on the on going activities of the Project and more generally on voice to text recognition developments, facilities in tele-education programmes and technical aids for the deaf.

3.1. Subtitling conversations, school lessons and conferences

The text, generated by voice to text recognition commercial packages, may be of help for a deaf person during a normal conversation. But it is displayed in complicated windows and organised more in the style of a word processor rather than that of a subtitling programme. Following comments and suggestions from JRC-ISIS and users, FBL developed a piece of software that makes this more basic use of voice to text recognition user-friendlier. This software puts the generated text on a screen in various dimensions and colours, which can be set by the user, and only displays the part of the output from the recognition system which is of interest. This function of the demonstrator turns a voice to text recognition package into a subtitling system, i.e. with a certain number of lines, with a certain length and in a certain style.

A first version of this program had been created in the second semester of 1996 and demonstrated to the users in the first semester of 1997. At that time the system was based around the idea that it was still necessary that a dictator interpreter would dictate the text. This was foreseen with the aim of starting the activities as soon as possible, by using a prototype for gaining experience and developing awareness of the users, without waiting for new improved releases of the commercial packages. We used the voice to text package IBM VoiceType 3.0 that also possessed a function, called VoiceType Direct, allowing the user to select a text window into which he/she wished to dictate. This allowed to overcome some obstacles in integrating the programs.

The operating schema at that time was the following. The speaker, or in some cases a dictator interpreter, spoke into the microphone headset attached to the PC. The voice to text recognition package converted the spoken message into text. This text was displayed on the screen of the PC and also converted into convenient lines of subtitles that were passed via a network to a second PC. Here the subtitles-files were loaded at intervals and displayed on a black screen. The signal from this second PC to its monitor was passed through a piece of video overlaying hardware. This superimposed all the information (i.e. the subtitles) onto a video source (in most cases the image of the speaker taken by a video camera), converting the black subtitle background into transparent. The final result was therefore a composite video signal of the subtitled video source that could be viewed on a television set or recorded on a video-recorder.

This prototype demonstrator has been working at JRC-ISIS since February 1997 and has been installed in April 1997 at ALFA in Milan, for testing and use at the Association's meetings. After discussions with the users, tests and validation of the prototype, several improvements have been developed. The acquired experience helped us in further tests on the new releases of IBM and Dragon packages recognising continuous speech. This VOICE Laboratory is being enlarged with new releases of the voice to text packages as well as the development of new pieces of software, new PC interfaces and video signal mixing systems.

As since December 1997 (when submitting the first draft of the present paper), the working station configuration of the prototype demonstrator is based on a PC Pentium 200 MMX, with 64 MB Ram, Cd-Rom drive, audio card Creative SoundBlaster AWE 64, monitor 17". The software is Windows 95 operating system, Dragon Naturally Speaking or IBM Via Voice and additional software developed by FBL. In order to take the input of a video camera and send the final output to a video-recorder, a video card Matrox 4 MB with Rainbow Runner has also to be installed. The Matrox card gives the possibility of taking the input of a video-camera, while the processor Rainbow Runner, used in addition to the Matrox card, helps the CPU in managing better the screen.

The PC inputs are images and sounds. The images taken by the video camera are displayed on the screen, using the video card internal processor. The sound, i.e. the voice of the speaker, is acquired by the sound card and then analysed by the voice to text recognition package. The output is sent to the developed application program, which provides to manage the number of rows: either fixed upon specific requirements, or defined in automatic mode, controlled by the speaker's pauses. The text is displayed in a textbox, on a coloured background, at the bottom of the screen and is organised so that words are never divided between two lines. The user has the options of altering the font and size of the recognised text, the number of characters of text on each line and the number of lines of subtitles, as well as the colour of the background immediately around the subtitles. The generated text can also be saved and filed for future reference or use, as printing reports of conferences and minutes of meetings. This aspect, developed for the needs of the deaf, is of particular interest also for the hearing persons.

The complete demonstrator set uses a video camera and a wall projector, which are useful for conferences or television broadcasts. In the classroom or at home the system may be used without this additional equipment and the Matrox and Rainbow Ranner cards may be not installed. We have also performed tests on the use of a portable PC, which gives good performances, provided that a Creative SoundBlaster card is installed. We have used wireless microphones too, taking some additional care in setting the signal's input level.

3.2. Voice and pauses handling

The vocabularies included in the voice recognition commercial packages are of a general-purpose type, but easy to personalise. The voice packages are partially voice independent, but, in order to perform higher accuracy, it is necessary for the speaker to train the voice package (this operation requires an hour and is made once). Then it is important to check the dictionary with the text that will be dictated more frequently. The voice package detects the words not included in the dictionary and asks the speaker to type and dictate them, for adding them into the dictionary. As an alternative, the system may process the text (if already available on a file) in batch mode, discover all the new words and ask the user to dictate them, for adding them into the dictionary. Then the user has to train himself to manage the product, in order to balance the pauses as to handle short sentences and avoid breaking in the speech where not necessary, so as to get the best results.

We have often discussed with the users a particularly important aspect. Since lip-reading lets some uncertainty in interpreting the words, and also the speech recognition system may produce subtitling lines with some errors, the combination of both could help the users to get as near as possible to the originally spoken text. When our speech is addressed to an audience with a large number of deaf participants, or when they are just a few but we consider as a priority that they should get the best comprehension of what is being said, we use the prototype just as any other working tool.

So, we do not mind of a nice presentation effect, but we pronounce short phrases (for instance 3 to 6 words) that the participants may lip-read on the image of our face taken by a video camera and projected on the wall screen. Then we make a short pause (greater than 250 msec). The developed application program recognises this pause as a command of completing the recognition process of what has been said (this will take approximately an other 250 msec) and shows the generated text as subtitling lines on the wall screen. After having read the lines, we may, if necessary, repeat just one or two words that have not been recognised correctly. Otherwise we may continue the speech or add some details, if we consider it useful to repeat a word or use a synonym. We may do so, either because some words have not been recognised by the system, or because the audience seems not familiar with some particular words. If necessary, we may also type some words on the keyboard.

3.3. Subtitling television transmissions

This part of the demonstrator's development would allow to subtitle a video source using voice to text recognition. It concentrates on testing various ways for displaying the subtitles and video signals. We developed a prototype and since March 1997 we could input a signal from a video-recorder (or television aerial) into the PC, visualise the image on the screen and create subtitles using voice recognition software. The accuracy of the recognition directly from the signal was insufficient, but could be improved through the use of a dictating interpreter. Some further tests are foreseen and we are trying to get experience also by the use of a digital voice recorder as input to the recognition software.

Since subtitling of television transmissions is the result of a manual preparation of files to be transmitted in Teletext format, most of the subtitled transmissions are films. Subtitling of the news and of live programs, even those addressed to the deaf, is rarely performed. Voice to text recognition package could help in subtitling the films, speeding up the subtitling operations and probably reducing costs. This might allow to have more subtitled films transmitted by television broadcasters. Nevertheless this will not change dramatically the frustrating isolation felt by the deaf community, who is looking for subtitling of live programs and of the news. For this more important aspect, the broadcasting companies might use voice recognition systems and produce good results. But the broadcasters should accept the risk of limited accuracy of recognition of the speakers' voice or should bear the cost of training the speakers to the use of the new systems and to build and update personalised new dictionaries.

An other approach could be considered too. Instead of using the Teletext technology for broadcasting the generated subtitling lines, for specific transmissions the subtitling lines may be made available through the Internet. Subtitles lines across the Internet will still have a high refresh rate and the Web pages will only contain a few images. Uses for such a program could also be foreseen for the subtitling of radio broadcasts. The subtitles do not necessarily have to be created by broadcasting companies themselves. Independent members of the public, with the correct equipment and programs, could listen to the radio or television, summarise what is being said into a microphone and the subtitles will be broadcast world-wide over the Internet.

3.4. On the telephone line

This function of the demonstrator involves tests on a laboratory prototype of a computer-driven telephone, converting voice to text, where a person can speak down the phone line, the message will be passed into a PC at the deaf person's end and the words will be visualised on the deaf person's screen. In this situation only the deaf person would need the appropriate equipment, while the other speaker may use any ordinary phone.

As for the subtitling application, the configuration for the telephone application is based on a PC Pentium 200 MMX, with 64 Mb Ram, Cd-Rom drive, audio card Creative Sound Blaster AWE 64, monitor 17". The video camera and the Matrox card are not necessary in this case, while a telephone line and a telephone set should of course be available. A filter has to be foreseen for decreasing the noise in the line and providing a better input signal. The software is Windows 95 operating system, Dragon Naturally Speaking or IBM Via Voice (which are used for voice recognition), Dragon Dictate for Windows V. 3.0 (which is used for managing the PC menus) and additional software developed by FBL.

The application will also include a text to speech system developed by MasterSoft to allow the deaf person to reply (should he/she have difficulties in speaking), which may also be useful in providing the person at the dictating end with feedback on whether what was said has been recognised correctly. It is worth noticing that when a hearing person is speaking at one end of a telephone line, and the text is showed at the deaf person's PC screen at the other end, the first user is blind with regards to the screen. Feedback of the recognised text is needed, just as a blind person needs text to speech synthesis to read a document from a PC. Taking into consideration this aspect, working for the deaf will also be beneficial to the blind.

The development of the prototype will, as first steps, improve the filters on the telephone line, make an application program to manage the conversation between the workstation and people calling it and write the message on the screen. Experiments will also be carried out to see whether reducing the number of words in the software's vocabulary will improve the recognition accuracy.

The technology behind recent releases of voice to text recognition software will reduce the problems caused by the background noise present on telephone signals. The recognition engines of previous releases worked on dictated words being disjointed. Noise levels of signals taken from a phone line were so high during these necessary pauses between words that the quality of recognition was very poor. The latest releases from both the IBM and Dragon Systems use continuous dictation and no longer rely on periods of silence to separate words. Tests carried out recently (October 1997) using a beta release of the IBM Med-Speak software for radiologists have given encouraging results. Due to the limitations of bandwidth, as well as the noise on telephone lines, tests will be carried out on the signals from various lines (ISDN, GSM).

3.5. Costs

For the final user the cost of each demonstrator may be considerably lower than expected since some voice to text recognition commercial package is presently being sold at about 100 ECU (10% of the original price). This package contains an active microphone, whose individual cost is more than half that of the complete package.

The approximate cost for the basic structure on which the applications are based (a Pentium PC with a SoundBlaster16 Sound-card, a Cd-Rom drive) is 1500 ECU, or 2500 for users demanding particular applications (video mixing, etc.). This price nevertheless also includes a fully operational PC that can be used in many other useful ways. A great advantage is also the fact that the system is not dependent on any particular company or software release. The equipment at the application sites (FBL laboratory and JRC VOICE Laboratory) is however slightly more expensive. The added costs are due to increased network capabilities and additional pieces of hardware and software to be used for everyday work as well as for comparisons, testing and demonstration purposes.

4. Results and final goal

Continuing on a well tested procedure, each step of the prototype is developed by FBL with the tight collaboration of JRC-ISIS VOICE Laboratory. It has been presented to the users and discussed and tested with them. Until now, when the opportunity has arisen, the demonstrator has been presented to conferences or general assembly of the Associations of the deaf, for giving awareness on the possibilities of the system. In the first quarter of 1998, the system has been presented to some schools, where it will be tested in real situations of use, with several constraints and specific difficulties. It will be used for subtitling school lessons (for the benefit of the deaf students), as well as for the visualisation of the dialogue pronounced during the foreign language lessons (for the benefit of the hearing students) or of the lessons on the host country's language (for the benefit of any user, particularly the immigrated). Some tests have been also foreseen for subtitling university lessons and printing summaries.

In the next months, we will use the prototype demonstrator for subtitling the speech of some speakers in conferences, this being an important opportunity of testing and validating the demonstrator in different operating conditions. During the ICCHP-98 Conference we will present the system to the users in an international environment and test it in English, Italian and German languages, in real operating conditions. The users will be encouraged to give their comments and suggestions in a VOICE Workshop and in the VOICE Special Interest User Group, which will hold its first meeting during ICCHP-98. The suggestions will help in improving the prototype in order to show other functions to the hearing impaired, the producers of speech recognition systems and the services providers.

This development is foreseen with the aim of discussing of new possible functions as soon as possible, gaining experience and developing awareness of the users, without waiting for new improved releases of the commercial packages. But, as it has been stated in a previous paper (The VOICE Project - Part 1) presented to ICCHP-98, the final aim of the prototype demonstrator is not, as in many other projects, to increase its size and its performances. The aim is just … to disappear! We hope that during the two years of the Project's life, we will be able to help the hearing impaired to convince the producers of speech recognition systems of the interest for them to include some of the proposed basic functions into the new releases of their standard products, for the benefit of any user.