Tuesday, May 27, 2014

Voice Recognition on the Raspberry Pi - Reality Check

My recent experience with voice control on the Pi got me to thinking.  Why wasn't this a rising star and constantly being talked about?  The idea of talking to your house is so compelling that there must be hundreds of implementations out there.

Well, there are, and none of them work very well.

I described my experience with Jasper, it just didn't live up to the hype.  So I went looking and did some experimenting.  Everyone talks about how good Siri is.  My experience with it is far less than stellar; all the phones I've tried it on misunderstand me about 6 out of 10 times.  Google's implementation seems to work better and I get about an 80% success rate.  Both of these are stellar compared to several software techniques I tried out, with the absolute worst being CMU Sphinx that Jasper was based on.

Remember, I'm looking at this as a way to control my house with a little computer, not dictate letters wearing a headset, so let me talk a bit about methods.  No, I'm not going to bore the heck out of you with a dissertation on the theories of voice recognition, I want what everyone else wants: I want it to work.  There are basically two methods of doing speech recognition right now, local and distributed.  By local I mean totally on one machine, and distributed is when they send the sound over the internet and decode it somewhere else.  Google's voice API is an example of distributed and CMU Sphinx is an example of local.

What we all want is for it to operate like Star Trek:


Nice clear beep

"Turn on the porch lights"

Nice clear acknowledgement, and maybe a, "Porch light is now on."

I went through the entire process of bringing up CMU Sphinx <link>, and when I tried it, I saw something on the order of, "Burn under the blight."  To be fair, Sphinx can be trained and its accuracy will shoot way up, but that takes considerable effort and time.  The default recognition files just don't cut it.  Especially when I tried the same thing with 100%, yes totally accurate results with Google's voice interface.  The problem with Google's interface is that it only works in the Chrome browser.  Yes, there are tools out there that use the Google voice API; notably VoiceCommand by Steve Hickson <link> , but expect it to quit working soon.  Google ended their offering of version 2 of the interface, and version three is limited in how many requests can be used and you have to have a special key to use it.  Thus will end a really cool possibility, I hope they bring it back soon.

So, the local possibilities are inaccurate and the distributed are accurate, but the one everyone was using is likely to disappear.  There are other distributed solutions, I brought up code taken from Nexiwave <link> and tested it.  There was darn near a 100% success rate.  The problem was delay.  Since I was using a free account, I was shuffled to the bottom of the queue (correctly and expectedly) so the response took maybe three seconds to come back.  Now, three seconds seem like a small price to pay, but try it out with a watch to see how uncomfortable that feels in real use.  This is not that Nexiwave is slow, it's that the dog gone internet takes time to send data and get back a response.  I didn't open a paid account to see if it was any better, this was just an experiment.

But, think about it a bit.  "Computer,"  one thousand and one, one thousand and two, one thousand and three, "Yes".  Then the command, "Turn on the porch light", etc.  It would be cool and fun to show off, but do you really want to do it that way?  Plus it would require that the software run continuously to catch the occasional, "Computer" command initiation.  Be real, if you're going to have to push a button to start a command sequence, you might as well push a button to do the entire action.  Remember, you have to have a command initiator or something like, "Hey Jeff, get your hand out of the garbage disposal, it could turn on," could be a disaster.  A button somewhere labeled, "Garbage Disposal," would be much simpler and safer.

Don't talk to me about Dragon Naturally Speaking from Nuance <link>.  That tool is just unbelievable.  It is capable of taking dictation at full speed with totally amazing accuracy, but it only runs on machines much larger than a Pi, and not at all under Linux.  Even their development version is constructed for Windows server machines.  Microsoft has a good speech recognition system built right into the OS, and under Windows 8, it is incredible.  Especially at no additional cost at all.  But, there aren't many Raspberry Pi machines running Windows 8.

Thus, I don't have a solution.  The most compelling one was Nexiwave, but the delays are annoying and I don't think it would work out long term.  Here's the source I used to interface with it:


# Copyright 2012 Nexiwave Canada. All rights reserved.
# Nexiwave Canada PROPRIETARY/CONFIDENTIAL. Use is subject to license terms.

import sys, os, json, urllib2, urllib, time

# You will need python-requests package. It makes things much easier.
import requests

# Change these:
# Login details:
USERNAME = "user@myemail.com"

def transcribe_audio_file(filename):
    """Transcribe an audio file using Nexiwave"""
    url = 'https://api.nexiwave.com/SpeechIndexing/file/storage/' + USERNAME +'/recording/?authData.passwd=' + PASSWORD + '&auto-redirect=true&response=application/json'

    # To receive transcript in plain text, instead of html format, comment this line out (for SMS, for example)
    url = url + '&transcriptFormat=html'

    # Ready to send:
    sys.stderr.write("Send audio for transcript with " + url + "\n")
    r = requests.post(url, files={'mediaFileData': open(filename,'rb')})
    data = r.json()
    transcript = data['text']
    # Perform your magic here:
    print "Transcript for "+filename+"=" + transcript

if __name__ == '__main__':
    # Change this to your own
    filename = "/data/audio/test.wav"
I took this directly from their site and posted it here because it is hard to find, and I don't think they care if I advertise for them.  All I did to make it work was to sign up for a free account and enter my particulars in the fields up at the top.  It worked first try; simple and easy interface.  It would be relatively easy to adapt this to a voice control system on my Pi if I decided to go that way.  Which I may do for control in the dark of my bedroom where I don't want to search for a remote that may be behind the side table.

The audio file I sent was my usual, "Porch light on," and it decoded it exactly first try.  I tried a few others and they all worked equally well.  Which brings up another item, sound on the raspberry Pi.  Frankly, unless you're dealing with digital files and streams, it sucks.  There isn't enough filtering on the Pi to keep audio hum out of things.  The amplified speakers I was using had a constant low level hum (regular ol' 60 hertz hum), and it would get into the audio captured from the USB microphone as well.  This could have been reduced by an expensive power supply with very good filtering, or maybe not; I didn't try.

To add insult to an already injurious process, ALSA (Advanced Linux Sound Architecture) is the single most confusing sound implementation I've ever seen.  It was constructed by sound purists and technology students so it is filled with special cases, odd syntax, devices that mostly work, etc.  The documentation is full of 'try this'.  What?  I love experimenting, but I sort of like to have documentation that actually has information in it.  Pulse audio is another possibility, but I'll approach that some other time.  Maybe a few weeks after hell freezes over, ALSA was bad enough.  But, if you're going to experiment with sound under Linux, you'll have to deal with ALSA at some point.  Especially if you actually want to turn the volume up or down.

I think I'm going to do some research on remote control ergonomics.  There's got to be a cool and actually useful way to turn on the porch lights.


  1. You probably saw it, but if not, the following may be worth a quick read.

    How to Upgrade Jasper’s Voice Recognition with AT&T’s Speech-to-Text API

    Or if the above doesn't work:

    I haven't personally tried it out, as I've all but given up on voice recognition. I still find it interesting, my first exploration of the technology started with HearSay II back in the 80's. Even the much touted Dragon on a contemporary desktop is marginal.

    1. No, I hadn't seen it. I have now. This is very, very much like the system Google had going and took off line. I'm sure it works really well, but it has the inherent problem of having to interact over the internet. Plus, it expires every 90 days and has to be renewed. What a pain to remember over a long period of time.

      Then the kicker, they can remove the free service at any time and leave you with nothing, just like google did, and the service charge of $99 a year is more than I'm willing to pay for 10-15 commands a day.

      But, If I can figure out a way to avoid using Jasper at all and just interface to this service, I might give it a try. I don't really want voice recognition, I want command recognition. Actually, truthfully, I want a remote control that I can't lose in the cushions of the couch, and is always within reach.

    2. Lol, well at 10 commands per day and $99 for the service, that's only 2.7 cents per command :) (rounding off of course). As you've stated, I want the Star Trek, Majel Barrett voice on my 100% accurate, computer audio command-response system. Based on industry review, we've still got a long way to go. Even Apple's reported/upcoming entry to the HA industry is unlikely to provide what most of us like-minded folks are looking for. Given the lackluster progress of the last 30 years, I expect we might start to get there in 5 to 10 more years.

      In the meantime, I continue to use remotes/tablets that indeed get lost on a nearly daily basis.

  2. Movi shield for the Arduino it looks awesome, the way they programmed it makes more sense. You should take a look at it.

    1. Thanks a lot for the pointer, and as you pointed out, this is a really, really compelling effort. However, I'll wait until I see more about the device. Currently, it requires a headset to talk to it, and who wants to wear a headset around the house to talk to your computer.

      After wasting quite a bit of time on voice recognition, I'm not really willing to jump in and be disappointed all over again. When some of the users start showing their projects and I can get a feel for how useful it might be, I'll decide.

      I really want this thing to work though.

  3. I'm interested in doing this as well, though my aim is to start by making an alarm with more interactivity than Android's native voice controlled. This is going to be my first coding project and I'm surprised by also surprised by how little it seems to be in use.

    You can actually do Google speech-to-text offline with an Android phone: https://9to5google.com/2016/03/11/google-accurate-offline-voice-recognition/
    This includes phones with the same specs as RPi 3. It works for my Coolpad Catalyst (cheap phone).

    Would it be possible to download it from the phone? Right now I'm thinking about simply using my phone as the microphone, much like this guy did for his LED table: https://www.youtube.com/watch?v=gihvvbNIEo8

    Your thoughts on this would be appreciated. I'm interested in getting past this hurdle so I can move on to working on the STT text itself.

    1. Back in 2014 when I looked into this Google had the very best implementation of voice, but they discontinued it. You could record a bit and send it to google and they would send you back the text. Since they weren't decoding the text, it went pretty fast. I liked it a lot, but they discontinued support for it. I'm (right this second) working with an Amazon Dot, and it is working pretty well and show real promise as a voice controlled device in my house.

      I looked into using OK Google to control things around the house and it showed a bit of promise, but dropped it when I got tired of carrying the phone around the house. Frankly, if you have a phone in your hand, it's just easier to start an app and push an on-screen button than it is to overcome 80-90 percent recognition. I wound up screaming at the phone which made it much worse.

      The silly little Amazon Dot is sitting right beside me and works darn near every time. I go to bed and tell it 'goodnight' it turns off the outside lights, the patio light by my room that my dog insists on being on in the evening and the bedroom lights. Then the sexy female voice says 'Goodnight'.

      Think about it a bit before you really dig into the phone. Maybe try a couple of experiments to see if the phone is the right device for you. I suspect there will be many voice control devices coming out in the next few years. Heck Google and Amazon are in direct competition right now. It could be that your project takes an unexpected turn.

      Meanwhile if you want to follow my extremely long and involved efforts with the Dot, just look at my latest postings. I'm going into greater detail on how to do it than any other project I've seen out there. Unfortunately, it's taking a lot of time to get it all up there and I'll probably be a couple of weeks more in writing it up. That's why I'm doing it in pieces. If I tried to do it all at once, I'd never finish it.

    2. Thanks for the response. I'll check out your posts.
      Like I mentioned, I'm mainly interested in playing with the text itself. I'd like "Turn off the lights", "Shut off the lights", "Lights off" and "Kill the lights" to do the same thing by searching for a keyword in the topmost hierarchy / object - in this case, "lights" - and then searching for a keyword under keyword under object "lights" - in this case either "kill" or "off".

      Do you think this sounds reasonable?

    3. I've read through your Alexa posts. I've got to ask, why are you bowing down to AWS just to use their voice recognition? It's got to be easier to simply create programs that take as input strings. Here's a very quick example: http://imgur.com/a/8k6h2

      Don't you think that would be easier?

    4. First question: That sounds perfectly reasonable. Actually, that's the way most of the implementations that I've looked at work.

      Second question: Two reasons. First, I haven't figured out how to get the keywords from alexa all the way down to my Pi. I did do a little experimentation with using a Lambda function to forward them to me, but it didn't work. They may not let the messages go directly out or something. I'll look at it a little more over time.

      Second, I wanted to turn my experiments and eventual implementation into a lesson for other folk that are pursuing the same kind of thing. If they can get theirs working easier than I did, they always give me really good feedback and suggestions.

      Third, (No, I can't count very well) Getting a really good microphone or something that can pick words out of a noisy room is incredibly tough. I have the A/C on, the TV, fridge, freezer and a dog all making noise at the same time and it still hears me from across the room. This thing has a light circle on top of it that points towards the sound direction it hears and it almost never misses. A few things like the TV saying, "Aleck Baldwin" sometimes kicks it off, but very seldom. And, remember, I'm only at their mercy for the voice conversion, not anything else. That way I keep my freedom and data, and just use them as a service that I can shut off at any time.

      I really hate the sound implementation on the Pi. I'm sure someone will come along and get it working really well, but these little Dots are an awesome device for this kind of thing, and if they want to give the service to me free for a year or two and pennies after that, I'm willing to let them.

      Also, once you get a two way path for commands going for one device, the second one is about 10 lines of copy and paste to add. If you get a really complex device that has many functions, it could get trickier, but adding another light is a piece of cake. The big thing that is missing right now is being able to 'push' data. I can't do a door sensor that tells me when the door opens over the Dot ... yet.

  4. Nice post, Really liked it. Speech To Text Software is the most advance and great technology that ha been discovered, Really helped people to understand different language.

  5. I wonder if the Home system could boot from the PC to a remote Pi and then be able to shuttle information from the Dragon Naturally/ Pc to the Pi3/4/remote? Is there an app for that yet?

    1. The problem is the PC. You have to have it running all the time to catch words. Then without a wake word implementation of some kind, it would listen to everything.

      I did integrate Amazon Alexa into my house though. I can actually turn on the light these days with spoken words