Features in mobile guides: audio

When we started the Blue Lion project we looked at what other companies had done so far to deploy tourism services and content on mobile phones. We were struck by how many features were built in some of the mobile guides. We will review some of these features in a series of articles. Today we look at audio reading.

Providing audio to visitors is somehow essential: the smart phones or tablets are portable devices, just like audio guides in museums. You definitely do not want to read the text while looking at a painting (or monument). So, audio is a necessity, especially when you have a good quality content that is long enough to explain things thoroughly. If your text is just a paragraph or two you should not embark audio in your solution.

Natural voices
Now, how to provide audio? The best would be to offer natural voice, which implies having real people (hired or not) reading the text. While the results are definitely the best you can get, there are some drawbacks: unless you reduce the quality of the recording (and risk lose the advantage of using natural voice in your solution) the size of the audio files (most often mp3) may become too heavy for potential customers to download your guide. It would even become impossible to do it over a 3G connection, given the limitations put by the main platform providers (including iOS and Android) on apps that can be downloaded over cellular connection (20-30 MB). Natural voice implies also potential higher costs, due to the need to hire actors and a recording studio as well as handling an elevated number of files in the original and translated versions. Consider also the loss of quality of any audio when listening to the lectures on the smart phones' tiny loudspeakers.

Text-to-speech or Voice Synthesizer
The other option available is to use voice synthesizing, or text-to-speech software. There are essentially two ways to work with voice synthesizers: one is to create the audio files and then add them to the app and the other is to embed the synthesizer directly in the app, eventually as a premium feature for the app. The first solution doesn't necessarily eliminate the issue of handling large mp3 files, although synthesized  voices may be more easily compressible.

Another solution consists of embedding the software directly in the app. This implies that text will be synthesized on the fly and that any update of the texts will be automatically read by the software. In this case users will have to download the "voices" which will read the text. These are "heavy" files too: per voice the footprint can be between 16 and 150 MB depending on the quality of the voice. This solutions is best for dynamic content, that is content that is created locally on the device or grabbed by the device from other sources. Whether customers will want to download such a file for just one or two promenades remains to be seen.

Quality of text-to-speech solutions has improved dramatically in the past few years, to the point that higher quality solutions and voices are very difficult to distinguish from natural voices, especially when listening on portable devices. 

One former issue of TTS solutions, namely the rendering of foreign words in the spoken language is dealt-with with the use of phonemes, either manually written or, in some cases, automatically translated by the software. For instance, Jean-Baptiste Colbert will be pronounced [ʒɑ̃ batist kolbɛʁ]
in English.

Some of the major providers of voice synthesizing software include:

- AT&T Natural Voices, provided through Wizzard
- Acapela for iPhone
- Nuance Vocalizer
- Ivona Text to Speech
- Cereproc
- Odiogo
- Loquendo (recently purchased by Nuance)

There is no optimal solution for including audio in mobile guides. The best quality can be achieved with natural voices, but this will imply higher costs and the need for users to download huge audio files for each guide. Voice synthetizers may be less expensive but the quality of the audio input could vary depending on the providers and the voices.