Redial: Interactive Telephony : Week 9

Text to Speech in Asterisk (Festival)

Sorry Dave Concept to Speech
It does not compute
- Hopefully we won't hear these too often!

Speech Synthesis

Artificial production of human speech

Historically attempts have been mechanical in nature, tubes, simulating vocal cords and so on. Instruments that simulate speech have also been attempted. Unfortunately, these devices and instruments don't have the flexibility of human articulators (such lips, tongue and teeth).

Vocoder - Early voice synthesis, used as an instrument

More recent attempts at speech sythesis have been done with computers.

Two characteristics used to judge the quality. Naturalness and Intelligibility. Naturalness is how much it sounds like a human and intelligibility is how easily it can be understood.

3 main techniques:

Database of phones or diphones (concatenative synthesis) - Flexible able to produce a wide variety of words, not terribly easy to understand (intelligibile) but can be somewhat natural sounding. Has "glitches" due to the nature of combining parts of speech.

Limited Domain - Recordings of entire words, phrases and perhaps even sentences are stored for specific use. Telling time, pronouncing the alphabet, reading numbers. This is very easy to understand and natural sounding but not very flexible.

Mathematical models (Formant synthesis) - Shows great promise, expensive but can be very convincing. Highly intelligible not very natural.

Some key words and concepts:

Phonology - The study of the sound system of a language (abstract)

Phonetics - The physical production and perception of sounds that comprise speech (concrete).

Phone - A portion of speech that has a distinct physical or perceptual property (concrete).

Phoneme - The abstract representation of a sound (abstract).

Prosody - A term used referring to elements such as intonation, pitch, rate, loudness, rhythm used in speech.

Diphone - A pair of phones spoken together. Diphones are used in speech synthesis to create sounds that are more natural sounding than combining phones directly. The transitions between phones are different depending on the phones used, diphones capture those transitions.

More Information:
  • Speech Synthesis - Wikipedia
  • Phonology - Wikipedia
  • Phonetics - Wikipedia
  • International Phonetic Alphabet - Wikipedia
  • Written Language

    In order to perform text to speech, computers need to be able to turn our speech into something that can be spoken.

    This requires not only turning the written language into phones or sounds but understanding punctuation, timing, intonation and focus.

    From "The Talking Computer - Text to Speech Synthesis":

    HAL: I enjoy working with people.

    He could stress any word in the sentence and change its meaning. If he stresses I he contrasts the meaning with "you enjoy ..."If he stresses enjoy, he implies a contrast with "I hate ..."When working is stressed, it means "rather than playing." To convey the meaning of a message the computer must assign a prominent stress to the correct word.

    As you can see, turning text into intelligent speech is no easy task, even if we could make the computer sound natural and intelligible.

    Do we want computers to talk? Was it good that HAL talked?

    More Information:
  • The Talking Computer - Text to Speech Synthesis
  • Smithsonian Speech Synthesis History Project

  • Systems

    There are many different speech sythesis engines and databases of phones available to researchers. There are many more that are commercial products.

    Here are some that I think you will find interesting:

    The Festival Speech Synthesis System - What we will be using with Asterisk. (Open Source)

    MBROLA - Diphone Databases (of primary interest to us when using Festival, Not Open Source) | A list of Various Software that uses MBROLA.


    Java FreeTTS - Open Source

    Open Mary Java based, XML and has rich prosodic capabilities (emotional speech).

    AT&T Natural Voices Text to Speech | Research Site | Demo

    Festival for use with Asterisk

    Asterisk has a handy dandy command for working with a Festival server:

    		 Festival('Hello World, I am a talking computer!')   ; quotes are important
    		exten => s,1,Festival('Hello World, I am a talking phone system')

    Unfortunately, the Festival command for Asterisk doesn't give us much flexibility for determining the voice to be used or other timing elements.

    Festival uses the "scheme" programming language to define it's configuration. I don't pretend to understand it but here is an example from a configuration file:
    			(set! after_synth_hooks
    				(lambda (utt)
    				   (lambda (x)
    					 (format t "%s %s\n" (item.feat x 'segment_end) ( x)))
    					(utt.relation.items utt 'Segment))
    				   (utt.wave.rescale utt 2.6)))
    			text2wave -eval '(voice_kal_diphone)'


    Fortunately for us, we can gain a little bit of this power by using the text2wave system command instead of using the Festival command directly.

    Here are our options:
    		[sve204@social festival]$ text2wave -?
    		text2wave [options] textfile
    		  Convert a textfile to a waveform
    		  -mode   Explicit tts mode.
    		  -o ofile        File to save waveform (default is stdout).
    		  -otype  Output waveform type: ulaw, snd, aiff, riff, nist etc.
    						  (default is riff)
    		  -F         Output frequency.
    		  -scale   Volume factor
    		  -eval   File or lisp s-expression to be evaluated before

    The easiest way to use this in Festival is like so:
    			exten => s,1,System(echo 'Hello, I am a phone not a person' | /usr/bin/text2wave -scale 1.5 -F 8000 -o /home/sve204/tester.wav);
    			exten => s,2,Background(/home/sve204/tester);

    We can also pass in arguments to use different voices:
    			echo 'Hello World' | /usr/bin/text2wave -F 8000 -o /home/sve204/tester2.wav -eval "(voice_us1_mbrola)"

    This gives us a bit more flexibility but what if we want more more more control?

    Fortunately there is an XML spec called SABLE

    SABLE: A Synthesis Markup Language (version 1.0)

    With SABLE you can create a text file and pass that to text2wave. Here is a sample:
    			<?xml version="1.0"?>
    			<!DOCTYPE SABLE PUBLIC "-//SABLE//DTD SABLE speech mark up//EN" 
    			<SPEAKER NAME="kal_diphone">
    			The boy saw the girl in the park <BREAK/> with the telescope.
    			The boy saw the girl <BREAK/> in the park with the telescope.
    			Good morning <BREAK /> My name is Stuart, which is spelled
    			<RATE SPEED="-40%">
    			<SAYAS MODE="literal">stuart</SAYAS> </RATE>
    			though some people pronounce it 
    			<PRON SUB="stoo art">stuart</PRON>.  My telephone number
    			is <SAYAS MODE="literal">2787</SAYAS>.
    			I used to work in <PRON SUB="Buckloo">Buccleuch</PRON> Place, 
    			but no one can pronounce that.

    Another Example:

    			<?xml version="1.0"?>
    			<!DOCTYPE SABLE PUBLIC "-//SABLE//DTD SABLE speech mark up//EN" 
    			<SPEAKER NAME="kal_diphone">
    			Good evening class <BREAK />
    			How are you all doing?
    			My name is Shawn.
    			Should I <VOLUME LEVEL="loud">yell loudly</VOLUME>
    			or should I speak <VOLUME LEVEL="quiet">in a quiet voice</VOLUME>
    			Should I <RATE SPEED="+100%">speak in a fast voice</RATE> or
    			should I <RATE SPEED="-50%">speak in a slow manner</RATE>

    If the above was a text file named test.sable we would create a wav file using the text2wave command such as follows:

    		/usr/bin/text2wave -F 8000 -o /home/sve204/testsable.wav /home/sve204/test.sable

    Here is an article regarding SABLE with Festival Sable

    Here are the supported Supported Tags

    AGI + Web/RSS/XML + Festival

    Just an example: php_rss_example

    Building Voices in Festival
    For the really really ambitious:

    Voice Demo's