fbpx

Lab Notes: Building Our Own Voice Assistant With Jasper

Lab Notes: Building Our Own Voice Assistant With Jasper
Reading Time: 6 minutes
Developing a smart restaurant tool using the open source personal assistant

Speech recognition is a technology we have been interested in for a while. Some of our earlier projects include integrating Alexa with Slack and creating an Alexa skill to aid with inventory management. We have been eager to work with speech recognition again since attending this year’s SpeechTek conference in Washington, DC. Excited to apply some of the lessons we learned and develop something both interesting and practical, we set out to create a smart digital assistant with Jasper

Getting Started

Jasper is an open source, modular platform which combines several component systems to create a powerful digital assistant. The platform is able to work with a variety of existing speech-to-text (STT) and text-to-speech (TTS) engines, some of which use an internet connection to offer more robust performance, while others include simpler functionality in order to eliminate the need for internet connectivity. Jasper supports plugins written in Python, each containing the specific set of words that should trigger that plugin. In a complete Jasper process, a microphone and the chosen STT engine are used to listen for a wake word and the commands for the plugin the user wishes to activate.

Upon hearing recognized commands, Jasper will execute the plugin matching the given commands. The chosen TTS engine is used throughout the process to synthesize speech feedback to the user, and then play it through the speakers, when needed. Once the plugin has finished executing, Jasper then returns to a standby state, continuously using the microphone to listen for the wake word but otherwise remaining idle.

On the hardware side, we used a Raspberry Pi 3 Model B, Kinobo Akiro USB microphone, and some USB speakers. No more hardware is necessary for Jasper to operate once it’s running, although further hardware may be helpful during development.

Design

For this project, we were motivated to create a natural-feeling, speech-based assistant that might be useful in a restaurant setting, and we chose the wake words “Hello Jasper.” Following this theme, we created plugins that allow Jasper to connect to a server; retrieve data for average wait time, queue count, critical alert count, or a summary of overall status; and accept commands to assign a table for cleanup or an order for preparation. We also wanted the product to be able to run without having to transmit data for speech recognition or synthesis.

Employing some of the concepts we learned at SpeechTek, we also designed Jasper to have a more natural speech interface. We included several different command phrasings for each of the module trigger phrases and for each of Jasper’s responses to commands, and had Jasper deliver requested data in a more human way.

For instance, if the user asks, “Are there any critical alerts?”, rather than the more typical phrasing a computer or robot might use, e.g. “Current queue count: 5,” we set Jasper to respond in a more natural way, e.g. “I see 5 critical alerts at the moment”.

An additional step we took to realize our goal of a natural speech interface was changing TTS engines; as we neared the end of development, we decided to switch to a TTS engine that we felt offered slightly more realistic speech.

Development

Our first step was to set up Jasper. We initially installed the software following the official documentation guide, and sought out additional information from other online resources such as user forums. In the course of this research, we learned that a newer version of Jasper is available on a developmental branch of their GitHub repository. We then shifted gears to finding information on installing this newer version of Jasper, as documentation for it does not exist on the Jasper website. After finishing this installation, we saw that the software design of Jasper had changed a considerable amount, and did some analysis of the sample modules that were included with Jasper to determine how custom plugins need to be written for the newer version of Jasper.

After Jasper was successfully set up, we moved on to setting up our STT plugin. For this part of the system, we chose the Pocketsphinx version of CMU Sphinx. CMU Sphinx is a robust, open source speech recognition engine. Pocketsphinx is a lighter version optimized for use in handheld and mobile systems. This installation also entailed some research, as there is an up-to-date version that is not covered by the Jasper documentation. Pocketsphinx also requires several libraries to function, which totals to a lengthy and sizable installation step. Thankfully, other resources exist online which made it possible to determine the proper version and installation procedure for Pocketsphinx.

The STT plugins we worked with were a relative breeze to set up. We initially chose the Flite version of Festival TTS before switching to SVOX Pico, but in both cases installation and configuration was quick and simple compared to Jasper and Pocketsphinx.

Difficulties

Although the final product we were able to achieve with Jasper was impressive, the process involved some challenges. Most of the software we used for the project required rather lengthy download and compilation steps, as well as sizable downloads. Whether that’s a problem or merely something to factor into planning depends on the particular goals and requirements of the developer. Our requirements were that Jasper run offline and on a Raspberry Pi. From these requirements, we decided to use the Pocketsphinx version of CMU Sphinx for TTS and SVOX Pico for STT, and these choices came with some setbacks.

Jasper documentation is unfortunately quite lacking, and it took a considerable amount of searching through user posts and discussions to piece together a successful installation procedure and handle bugs. Furthermore, as of the development of this project the official Jasper site was lacking updates to cover the latest, developmental branch of Jasper. This development version of Jasper is actually easier to work with, but requires additional work to find user documentation detailing how to make use of it with its changes.

Pocketsphinx, as implemented by Jasper, works quite well but requires some special planning. We found that Jasper worked best when configured with not too short and very distinct command phrases. Although our custom modules worked just as expected, the issue was rather that a command would be misunderstood as a different known command, and the module for that command would then execute.

There is also a quirk of the continuous-listening mode of Pocketsphinx which is likely to result from any system that is always-on. After a potential command is heard and while it is processed to determine if it is a recognized command, further speech input is not possible until processing has completed. What this means in practice is that if Jasper erroneously thinks it has heard a command, whether it did hear speech or merely noise, some time will pass before the user will be able to actually give a command. This is not necessarily a problem, but it may interrupt the desired flow of operation — and the experience of natural speech — if the user’s command is not recognized on the first try or if background noise triggers the command processing, as the user will be required to wait and then repeat his command.

For our TTS engine, we encountered a sudden bug during the process of switching from Flite to SVOX Pico. The bug entailed synthesized speech sounding distorted, although it initially was clear. From some cursory research, this likely resulted from a bitrate issue and would have needed some changes to the system configuration, but as it did not impact the effectiveness of the system, we did not prioritize removing this issue. 

Closing Thoughts

In spite of the obstacles we encountered, we found Jasper’s ability to smoothly compartmentalize and synthesize the different aspects of a digital assistant system quite impressive. The custom modules and supported speech engines do plug in or swap out just like it says on the box, so to speak. More up-to-date documentation would make the development process simpler, but in its present state Jasper is fleshed out enough to be able to create a working product. Working within our restrictions of an offline system, Jasper, Sphinx, Festival, and SVOX were able to accomplish the goal we envisioned.