Making a case for expressive speech synthesis

To Whom It May Concern;

As an AAC clinician and researcher, I am writing this letter of support for ensuring that AAC systems  continue to move towards integrating speech synthesis technology that allows for expressive speech.

Specifically, there are two key technological advances that need to be integrated into current and future AAC systems:

1 – Custom / personalised voices – this is the capability of having a voice that is consistent with one’s expression of identity.   It is important to have a voice that matches one’s gender expression, regional and cultural identity, age, and body mass.  Having a unique voice also helps listeners attribute spoken contributions to the AAC speaker. 

2 – Expressive speaking range – this is the capability to use and control prosody, intonation, and tone of voice.   To be clear, this should not be confused with emotional speech. A small subset of expressive speech includes “emotional” speech, such as angry or happy, however research suggests that emotional speech is not nearly as important for communicative competence as having an expressive speaking range.  Expressive speech also goes beyond having access to whispering, yelling, or whining voice options. 

The first feature – personalised voices – can be achieved with both unit selection and parametric (e.g., HMM based systems) speech synthesis technology.  

The second feature – expressive speaking range -  requires parametric approaches, which only some speech synthesis providers use.   

There are some open questions regarding how to design user interfaces for both personalised voices and expressive speech.  My ongoing research suggest that there are many practical and affordable solutions to address both sets of questions.   I only mention these open questions because I believe that until people have direct experience using these features, there will be ongoing confusion regarding these two essential aspects of speech synthesis. In my research, people stop confusing and merging the concepts of personalised voices with expressive voices once they experience them in action.

With that background aside, I would like to outline why expressive synthetic speech is both needed and desired, particularly for children who use AAC systems:

  • Expressive speech is motivating and engaging for children
  • Pragmatically there are strong social consequences for getting the tone of voice wrong.  There are also consequences for lack of expressive speech.
  • Children learn to understand and imitate tone of voice very early in child development.  Linguistic understanding and use comes months, if not years, later.  Current AAC technology can be too difficult for users who due to age or cognition are still developing symbolic communication skills. 
  • Adults struggle to model language with AAC systems to children when the devices do not match the expressive way humans actually speak
  • The voices on AAC devices are a key reason many families and children reject AAC devices  during trials or later abandon them when they don’t meet every day social communication demands.
  • Some children with autism find producing and controlling expressive speech with their biological voices difficult.  There are likely therapeutic benefits to using expressive speech synthesis to explore and learn how this core aspect of spoken language ‘works’ in social situations.
  • Young children need to communicate for many purposes (requesting, clarifying, protesting, asking questions, refusing, telling jokes etc).  To do this well, you need expressive voices.  To do this independently without expressive voices, you need to learn how to combine words. This is a skill that takes years for typically developing children to acquire; AAC users often need even more time.  With expressive voices, someone primarily communicating at the single word stage can easily say “daddy” as a request, protest, clarification, comment, question, answer, or command.  Without expressive voices, 2+ word phrases are often needed.  
  • AAC is slow – having to find the linguistic words to compensate for lack of tone of voice is linguistically and cognitively taxing. It also requires more time than selecting a single word and the correct tone of voice.
  • The best contexts for language development – play and social interaction – heavily rely on tone of voice.  Imagine pretend play, interactive storybook reading, retelling stories,  joking, and consoling a peer without tone of voice.

 

In addition to my observations and thoughts as a clinician and researcher, I also wanted to share the perspective of families, AAC users, and professionals regarding the need for expressive voices:

“Verbal communication is so much more than the choice of words; much is also conveyed by the tone of voice, the expressive voice. Although the current synthesized voices are great at saying words, they lack expression. This means that those of us who rely on these voices to communicate are still not able to communicate fully. We are left with monotone, expressionless communication. As a motivational speaker who uses a communication device, this is extremely annoying and frustrating as I have no way to fully express excitement, passion or even a rant. How impactful can a motivational speaker be without tone of voice?”

 - Glenda Watson Hyatt  - Author, Motivational Speaker and Badass Agitator

 “Good relationships rely on a communication partner. If AAC is important to communication, adding voice enhances the experience for the partner and makes it more likely that they will continue with, expand on and value the interaction. Adding expressive voice is like the icing on the cake. The more enjoyable interaction is - the more likely the messages will be meaningful. And what kids who have communication challenges most need is positive experiences of interaction. Lots and lots of them.”

 – Keryn Mells, Autism professional

“I want to be able to have expressive voices so that others don't get confused over what I'm trying to express through my AAC device, because as an autistic individual, my body language doesn't often match up to what I'm feeling on the inside, and as someone who has alexithymia I need to have a voice role modelled to me [to show] what a feeling is supposed to sound like” 

- Gabrielle Hogg, AAC user, Autism advocate over at Autismo Girl

“My children would definitely choose voices with emotion over monotone. It would save them repeating phrases for added emphasis to ensure others are listening.
Being able to convey different emotions would increase the likelihood of my children using AAC.”

  • Stephanie R., Parent

“My son who has very limited spoken and signed expressive language loves to communicate and can be very dramatic using inflections in tone, to express himself. He has a AAC device but does not really bother using it much and I am certain if it had a more expressive output voice rather than the monotonous tone he would use it much more.” 

  • Frian Wadia, Parent

In summary, I strongly believe that we need to add expressive speech synthesis capabilities to AAC devices.  There are some unknowns regarding this, particularly around user interface, but these are research questions that are well on their way to being solved.  

At this stage, what we do know is that not all synthetic speech companies provide voices capable of expressive speech.

When deciding upon speech synthesis vendors, I urge you to consider whether the selected solution allows for both custom personalised voices and  expressive voices.    Both capabilities are essential and also increasingly desired by end users, funders, and AAC assessors.

 I know that as soon as expressive speech is a reality on a mass-market AAC device, it will be a key consideration in what I recommend during my AAC assessments, particularly for children who are the most disadvantaged by the lack of expressive synthetic speech.

 

Kind regards,

 

Shannon Hennig, PhD CCC-SLP
Researcher and AAC clinician