Modifying the Google AIY Voice Kit to synthesize realistic voice.

Background

There is growing interest in language processing and voice feedback for brain computer interface (BCI) system design. Generally, vocal feedback has been important to convey to researchers whether a patient experiences pain or whether the action was intended.

The Google AIY Voice Kit is a $50 piece of equipment that's intended for educational use of artificial intelligence systems. It is able to tap into the "OK Google" engine, perform speech-to-text actions and even respond using an simple, open-source text-to-speech package.

Problem

The issue is that the speech synthesis is terrible. It's unusable for BCI systems where you may be interested in simulating an experimenter. For example, deploying a BCI system into the real world might require artificial voice instructions to guide setup, debugging and a host of other tasks. Take a look at the start of the video below for an example of what the speech synthesis sounds like with the Pico2Wave system:

Solution

Google offers a text-to-speech Cloud API that is much more realistic. Reshaping the AIY kit to tap it requires a little bit of setup. It may be difficult to imagine how speech-to-text and text-to-speech fit together into the same system, but take a look at an example framework in voice bots at contact centers to get an idea:

Setup

First, you should go through the official instructions to setup the API and authorization to access the speech-to-text services from Google Cloud here. You'll need the setup for the text-to-speech or speech synthesis portion too.

The complete setup requires connecting the the CloudSpeechClient located in the demo script with the text-to-speech scripts located in tts.py. I've linked my modifications, but the main changes are in the the tts.py scripts and are summarized below:

from google.cloud import texttospeech # import google cloud
import pygame # we use pygame to generate an mp3 file to play the audio returned from cloud

...

def google_tts_say(text, lang='en-US',gender='NEUTRAL',type='text'):
    # replaces the provided "say" function

    ...

    # the SSML format allows for more customization in speech (including pauses)
    if type == 'ssml':
        synthesis_input = texttospeech.SynthesisInput(ssml=text)
    else:
        synthesis_input = texttospeech.SynthesisInput(text=text)

   # Build the voice request, select the language code ("en-US") and voice gender (e.g. "neutral")
    voice = texttospeech.VoiceSelectionParams(
        language_code=lang, ssml_gender=g
    )

    # Select the type of audio file you want returned
    audio_config = texttospeech.AudioConfig(
        audio_encoding=texttospeech.AudioEncoding.MP3
    )

    # Perform the text-to-speech request on the text input with the selected
    # voice parameters and audio file type
    response = client.synthesize_speech(
        input=synthesis_input, voice=voice, audio_config=audio_config
    )

    # The response's audio_content is binary.
    with open("output.mp3", "wb") as out:
        # Write the response to the output file.
        out.write(response.audio_content)
        print('Audio content written to file "output.mp3"')

    pygame.init()
    pygame.mixer.music.load('output.mp3')
    pygame.mixer.music.play()
    print('playing')
    while pygame.mixer.music.get_busy() == True:
        continue

The key aspects and improvements to the AIY default code are:

  1. the use of SSML markup language to get customization in the voice, including realistic pauses between sentences,
  2. the use of Cloud provided genders to further customize voices
  3. working with MP3s (rather than .wav formats) which will be more efficient for longer synthesized voices
  4. using PyGame to deliver the audio back to the speaker efficiently.

Result

All of this yields some great results for in-lab BCI research or remote setups. Below I perform some tests, from 3 ft. away from the AIY Voice Kit, with some noise in the background.

There are classes of actions that are easily recognizable by the speech-to-text system (e.g. "turn on the light") that the device recognizes and acts on under 1 second. When we loop in text-to-speech, we add about 3 seconds of time required for the synthesis processing and delivery. You can also program custom responses - below I test out some lines from the Godfather.

You might have noticed in the test above that the system didn't recognize the phrase "I am box bot." Speech-to-text actually can take in custom vocabulary to make recognition of domain-specific language more accurate.

Specifically, for BCI systems, we may be interested in whether participant responses (for example the number of targets seen) are correct. Below I test whether we can easily detect whether a sequence of numbers (1,2,3) is correctly said by a participant.

Summary

Natural language systems are advancing at an absurd pace right now, and playing around with the AIY Voice Kit makes it easy to see the potential for research studies.

Voice synthesis through Cloud deployed models provides a useful way to deliver feedback and respond to participants, in real-time.