Custom Voice Components

Grzegorz Tluszcz

July 5, 2019

Night mode

When you start working on a voice project you begin with a simple Welcome Intent. Everything goes well and your app grows fast, so does the number of intents and supported functionalities. Suddenly you realize that it becomes hard to introduce any new changes and you find yourself lost in the code. It feels like you will need a more sophisticated architecture to support all of the business requirements.

This article is a proposal of architecture that we introduced in one of our apps to solve the issues stated above.

Requirements of the architecture

Let's think for a moment what are the characteristics of the desired architecture:

It makes navigating in the code easy
Allows us to reuse already implemented functionalities
Makes introducing changes effortless
Causes fast development of the app

And some more technical requirements:

We want to allow for invoking some intents (and their behaviour) only in specific conditions.
It has to support utilities (easy access to last speech, debug mode would be nice, no match fallback per intent).
It needs to allow us to dynamically change the flow of the conversation based on the user input.

Main idea

Our idea embraces the concept of modularity and chaining behaviors to form a conversation. A proper dialogue is often showcased using speech bubbles. We are used to seeing this cloud graphics with text inside to represent a single message (they also call them balloons). Anyway, what the user is often missing is how many actions have to happen to create such speech bubble. Building a message sometimes involves a call to the API or database which happens in the background, without user knowledge. We can think of all these actions as separate beings, that we chain together to create a speech bubble.

What is a component?

Each chain link defines and encapsulates some behavior of the app. For example, greeting user, retrieving data. From now on I will call a single chain link a component.

You may end up having components responsible only for adding text to the conversation or only for side logic (not seen by the user) or both.The important thing to remember is that components should be designed as a reusable whole so that they can be used in many chains of components.

What is a chain of components?

At this point, you probably feel that the chain of components is somehow analogous to a speech bubble. I will give one more example though for better understanding.

You may imagine having component that greets the user and the one that saves username from input to the database. Having a separate greeting component makes sense because this action may appear in different scenarios of our app. However, there is one scenario when the user visits our app for the first time. In such case, we would like to greet the user and save his name in one step/speech bubble. For that to happen, we need to chain these components together.

When components are chained together they create a chain of components.

Characteristics of the chain of components:

The first component of a chain is invoked by intent.
Each next component in a chain is invoked by a redirect from the previous component.
The last component in a chain either closes the app or waits for a user answer.

Components in action

Before we will talk about the business scenario that will help me showcase the use of the architecture. I would like to clarify some words that i will be using in the next few sections of this article.

conversation - dialogue
state -a specific step in a conversation
utterance - words used by the user
intent- a will of the user recognized by NLP based on the utterance
behavior/action/functionality - something that our app or one of its components does
callback - a function that defines the behaviour of the component (app)

Scenario

Image a situation when a client reaches out to your company and ask you to create a storytelling game for him. However, during the call, he feels unsure about the capabilities of voice technology and its potential for storytelling. Your team takes a step forward and proposes to create a simple demo to resolve his doubts.

The demo will:

tell the main story
optionally tell prolog, just to make the demo more interactive
say goodbye when the user will sign off

At the start of the application, we want to give the user a choice. He can either hear the main story straight ahead or hear prolog first. This means that asking for the main story as well as for prolog is only available right after the launch of the app.

In addition, it feels natural that right after telling the prolog we will immediately proceed to the main story. That way user will be able to hear the short or long version of the tale. After telling the story, we want to wait for the user to say goodbye and answer him politely.

We don't want the user to be able to ask for the story neither prolog after hearing it (to avoid confusion). The only available Intent will result in saying goodbye and ending the conversation. You may want to read it twice. It feels complicated but in a moment everything should become much easier.

What I have verbally described above, in fact, can be represented as a state machine. Certain intents are available in certain states. We can imagine this state machine as a graph. Vertices are states and outgoing edges are callbacks. Utterances and Intents are not displayed on this image.

‍

A distinguishable state is nothing else but component.

Component is defined by:

state name (can be automatically derived from callback)
callback (only one)
next possible components (can be empty set)

State name - thanks to state name we are able to whitelist next components, it plays a huge role in the implementation. The idea is that when an Intent is about to be recognized. The app checks in which state it is and which components it can invoke.
Callback - defines the behaviour of the component, may include a redirect to the next component.
Next possible components - We whitelist components that can be invoked next to ensure that our components will be invoked only when we wish. It also gives helps with navigation and understanding the order of components in a chain of components. Next components can be invoked directly by intent or by a redirect from a previous component.

The benefits

But why on earth do we need that? There is one more thing worth noticing when you look at the graph. Right now we have 3 chains of components which together have 3 components

‍

You can easily imagine a more complex example with a huge network of components that create complicated paths that split and merge many times. And that's where the concept of components really shines.

Thanks to the modularity, we :

write certain functionality only once and extract it to a separate component. It can be used in many chains of components and the logic is kept in one place
can easily manipulate the flow of the conversation just by redirecting to other components under some condition
can be sure that certain components will be able to be invoked only in the specific state
can easily change the order of behavior, simply by swapping components in the chain of components
have a good understanding of the order in which components are invoked,
keep our code clean

Another advantage of this approach is that it leverages the layer of abstraction and thanks to that can be easily applied to both Google Assistant and Amazon Alexa.

Implementation using Jovo

Enough talking! I will show you now how all this theory can be applied in practice by creating a very simple voice app.

First, let's implement our component! its state name will be afterLaunch, the callback will tell user 'Hello! What's up?' and there will be no following components.

// callbcks/launch.js

const followUpState = `afterLaunch`

const callback = function () {
this.followUpState(followUpState) // We set the state
this.tell('Hello! What\'s up?')
}

const routes = {
[followUpState]: {} // There is no following components
}

module.exports = {
intentCallback: intentCallback,
routes: callbackRoutes
}

Second, register all "next possible components" by inserting routes. What is unusual here is that we have to register callback manually since it is the first callback after launching the app.

// app.js

const launch = require('./callbacks/launch')

app.setHandler({
'LAUNCH': launch.intentCallback, // after detecting LAUNCH intent fire callback from our component
...launch.routes, // insert routes from our components (optional since they are empty)
})

At this moment we should have a simple app that launches and says 'Hello! What's up?'

Allow user to ask for a story. You will need to define proper intent with name AskForStory in your language model, then create a component.

// callbcks/askForStory.js

const followUpState = `afterAskForStory`

const callback = function () {
this.followUpState(followUpState)
this.tell('HERE WE TELL THE STORY')
}

const routes = {
[followUpState]: {} // There is no following components
}

module.exports = {
intentCallback: intentCallback,
routes: callbackRoutes
}

// callbcks/launch.js

const askForStory = require('./askForStory')

const callback = function () {
this.followUpState(followUpState)
this.ask('Hello! What\'s up? What do you want to do here?') // We change 'tell' to 'ask' not to quit the app after greeting
}

const routes = {
[followUpState]: {
AskForStory: askForStory.intentCallback // whitelist new component
}
}

Add it to routes.

// app.js

const launch = require('./callbacks/launch')
const askForStory = require('./askForStory')

app.setHandler({
'LAUNCH': launch.intentCallback,
...launch.routes,
...askForStory.routes // (optional since they are empty)
})

Now we greet the user and ask him what he wants to do. If he launches AskForStory intent then we will tell him the story.

Let's connect these components together into one chain, so that we will greet the user and tell him the story in one speech bubble.

// callbacks/launch.js

const callback = function () {
this.followUpState(followUpState)
this.$speech.addSentence('Hello! What\'s up? What do you want to do here?') //we need to use jovo speech builder
return this.toIntent('AskForStory') // here is how we perform redirect to already whitelisted component
}
// callbcks/askForStory.js

const callback = function () {
this.followUpState(followUpState)
this.$speech.addSentence('HERE WE TELL THE STORY') //speech builder again
this.ask(this.$speech) // we ask because we don't want to close the app
}

After launching the app you should now hear greeting and story at once. Try asking for the story again. It won't be available since it's not whitelisted in the current state.

Upgrade

At this point, you should get pretty much good understanding of how this should work and how you can add more components on top of previous ones. However, you probably have already realized that components have a similar structure. How about we go one step further and create a higher order function that does half of the job for us?

Create a helper method.

// ./helpers/createComponent.js

module.exports = function({
callback_name,
routes = {},
callback,
}) {
const followUpState = `after${callback_name}`

const intentCallback = function () {
this.followUpState(followUpState)
return callback(this)
}

const callbackRoutes = {
[followUpState]: {
...routes,
},
}

return {
intentCallback: intentCallback,
routes: callbackRoutes
}
}

Use the component in our code.

// callbcks/launch.js

const askForStory = require('./askForStory')
const createComponent = require('./helpers/createComponent')

module.exports = createComponent({
callback_name: 'Launch',
routes: {
AskForStory: askForStory.intentCallback
},
callback: jovo => {
jovo.$speech.addSentence('Hello! What\'s up? What do you want to do here?')
return jovo.toIntent('AskForStory')
}
})
// callbcks/launch.js

const createComponent = require('./helpers/createComponent')

module.exports = createComponent({
callback_name: 'AskForStory',
callback: jovo => {
jovo.$speech.addSentence('HERE WE TELL THE STORY')
jovo.ask(jovo.$speech)
}
})

Wow! this makes things a lot easier.
Couple lines of code and we have our new component in place. Thanks to this higher-order function we can add more logic to each component (such as debug mode, default no match fallback and custom repeat callback) at once. All of our components will get new powers almost without any configuration.

Summary

In my mind, the main goals of the architecture have been accomplished. We have flexible and reusable components that can be easily extended and are fast to implement. Each functionality has its own place. Code feels tidy and easy to tinker with.

The proposed concept of the architecture worked properly in our case. We managed to build a full storytelling game using this technique. We already have ideas on how to improve on this idea. More about that coming soon

‍