Desert Home: Voice Recognition on the Raspberry Pi

Tuesday, May 27, 2014

Voice Recognition on the Raspberry Pi - Reality Check

My recent experience with voice control on the Pi got me to thinking. Why wasn't this a rising star and constantly being talked about? The idea of talking to your house is so compelling that there must be hundreds of implementations out there.

Well, there are, and none of them work very well.

I described my experience with Jasper, it just didn't live up to the hype. So I went looking and did some experimenting. Everyone talks about how good Siri is. My experience with it is far less than stellar; all the phones I've tried it on misunderstand me about 6 out of 10 times. Google's implementation seems to work better and I get about an 80% success rate. Both of these are stellar compared to several software techniques I tried out, with the absolute worst being CMU Sphinx that Jasper was based on.

Remember, I'm looking at this as a way to control my house with a little computer, not dictate letters wearing a headset, so let me talk a bit about methods. No, I'm not going to bore the heck out of you with a dissertation on the theories of voice recognition, I want what everyone else wants: I want it to work. There are basically two methods of doing speech recognition right now, local and distributed. By local I mean totally on one machine, and distributed is when they send the sound over the internet and decode it somewhere else. Google's voice API is an example of distributed and CMU Sphinx is an example of local.

What we all want is for it to operate like Star Trek:

"Computer."

Nice clear beep

"Turn on the porch lights"

Nice clear acknowledgement, and maybe a, "Porch light is now on."

I went through the entire process of bringing up CMU Sphinx <link>, and when I tried it, I saw something on the order of, "Burn under the blight." To be fair, Sphinx can be trained and its accuracy will shoot way up, but that takes considerable effort and time. The default recognition files just don't cut it. Especially when I tried the same thing with 100%, yes totally accurate results with Google's voice interface. The problem with Google's interface is that it only works in the Chrome browser. Yes, there are tools out there that use the Google voice API; notably VoiceCommand by Steve Hickson <link> , but expect it to quit working soon. Google ended their offering of version 2 of the interface, and version three is limited in how many requests can be used and you have to have a special key to use it. Thus will end a really cool possibility, I hope they bring it back soon.

So, the local possibilities are inaccurate and the distributed are accurate, but the one everyone was using is likely to disappear. There are other distributed solutions, I brought up code taken from Nexiwave <link> and tested it. There was darn near a 100% success rate. The problem was delay. Since I was using a free account, I was shuffled to the bottom of the queue (correctly and expectedly) so the response took maybe three seconds to come back. Now, three seconds seem like a small price to pay, but try it out with a watch to see how uncomfortable that feels in real use. This is not that Nexiwave is slow, it's that the dog gone internet takes time to send data and get back a response. I didn't open a paid account to see if it was any better, this was just an experiment.

But, think about it a bit. "Computer," one thousand and one, one thousand and two, one thousand and three, "Yes". Then the command, "Turn on the porch light", etc. It would be cool and fun to show off, but do you really want to do it that way? Plus it would require that the software run continuously to catch the occasional, "Computer" command initiation. Be real, if you're going to have to push a button to start a command sequence, you might as well push a button to do the entire action. Remember, you have to have a command initiator or something like, "Hey Jeff, get your hand out of the garbage disposal, it could turn on," could be a disaster. A button somewhere labeled, "Garbage Disposal," would be much simpler and safer.

Don't talk to me about Dragon Naturally Speaking from Nuance <link>. That tool is just unbelievable. It is capable of taking dictation at full speed with totally amazing accuracy, but it only runs on machines much larger than a Pi, and not at all under Linux. Even their development version is constructed for Windows server machines. Microsoft has a good speech recognition system built right into the OS, and under Windows 8, it is incredible. Especially at no additional cost at all. But, there aren't many Raspberry Pi machines running Windows 8.

Thus, I don't have a solution. The most compelling one was Nexiwave, but the delays are annoying and I don't think it would work out long term. Here's the source I used to interface with it:

#!/usr/bin/python

# Copyright 2012 Nexiwave Canada. All rights reserved.
# Nexiwave Canada PROPRIETARY/CONFIDENTIAL. Use is subject to license terms.

import sys, os, json, urllib2, urllib, time

# You will need python-requests package. It makes things much easier.
import requests

# Change these:
# Login details:
USERNAME = "user@myemail.com"
PASSWORD = "XYZ"

def transcribe_audio_file(filename):
    """Transcribe an audio file using Nexiwave"""
    url = 'https://api.nexiwave.com/SpeechIndexing/file/storage/' + USERNAME +'/recording/?authData.passwd=' + PASSWORD + '&auto-redirect=true&response=application/json'

    # To receive transcript in plain text, instead of html format, comment this line out (for SMS, for example)
    url = url + '&transcriptFormat=html'


    # Ready to send:
    sys.stderr.write("Send audio for transcript with " + url + "\n")
    r = requests.post(url, files={'mediaFileData': open(filename,'rb')})
    data = r.json()
    transcript = data['text']
        
    # Perform your magic here:
    print "Transcript for "+filename+"=" + transcript


if __name__ == '__main__':
    # Change this to your own
    filename = "/data/audio/test.wav"
    
    transcribe_audio_file(filename)

I took this directly from their site and posted it here because it is hard to find, and I don't think they care if I advertise for them. All I did to make it work was to sign up for a free account and enter my particulars in the fields up at the top. It worked first try; simple and easy interface. It would be relatively easy to adapt this to a voice control system on my Pi if I decided to go that way. Which I may do for control in the dark of my bedroom where I don't want to search for a remote that may be behind the side table.

The audio file I sent was my usual, "Porch light on," and it decoded it exactly first try. I tried a few others and they all worked equally well. Which brings up another item, sound on the raspberry Pi. Frankly, unless you're dealing with digital files and streams, it sucks. There isn't enough filtering on the Pi to keep audio hum out of things. The amplified speakers I was using had a constant low level hum (regular ol' 60 hertz hum), and it would get into the audio captured from the USB microphone as well. This could have been reduced by an expensive power supply with very good filtering, or maybe not; I didn't try.

To add insult to an already injurious process, ALSA (Advanced Linux Sound Architecture) is the single most confusing sound implementation I've ever seen. It was constructed by sound purists and technology students so it is filled with special cases, odd syntax, devices that mostly work, etc. The documentation is full of 'try this'. What? I love experimenting, but I sort of like to have documentation that actually has information in it. Pulse audio is another possibility, but I'll approach that some other time. Maybe a few weeks after hell freezes over, ALSA was bad enough. But, if you're going to experiment with sound under Linux, you'll have to deal with ALSA at some point. Especially if you actually want to turn the volume up or down.

I think I'm going to do some research on remote control ergonomics. There's got to be a cool and actually useful way to turn on the porch lights.

13 comments:

BobWJune 7, 2014 at 11:41 AM
You probably saw it, but if not, the following may be worth a quick read.

How to Upgrade Jasper’s Voice Recognition with AT&T’s Speech-to-Text API

Or if the above doesn't work:
http://hackaday.com/2014/06/07/how-to-upgrade-jaspers-voice-recognition-with-atts-speech-to-text-api/

I haven't personally tried it out, as I've all but given up on voice recognition. I still find it interesting, my first exploration of the technology started with HearSay II back in the 80's. Even the much touted Dragon on a contemporary desktop is marginal.
ReplyDelete
Replies
UnknownFebruary 21, 2016 at 1:23 PM
Movi shield for the Arduino it looks awesome, the way they programmed it makes more sense. You should take a look at it.
ReplyDelete
Replies
UnknownNovember 12, 2016 at 7:24 PM
I'm interested in doing this as well, though my aim is to start by making an alarm with more interactivity than Android's native voice controlled. This is going to be my first coding project and I'm surprised by also surprised by how little it seems to be in use.

You can actually do Google speech-to-text offline with an Android phone: https://9to5google.com/2016/03/11/google-accurate-offline-voice-recognition/
This includes phones with the same specs as RPi 3. It works for my Coolpad Catalyst (cheap phone).

Would it be possible to download it from the phone? Right now I'm thinking about simply using my phone as the microphone, much like this guy did for his LED table: https://www.youtube.com/watch?v=gihvvbNIEo8

Your thoughts on this would be appreciated. I'm interested in getting past this hurdle so I can move on to working on the STT text itself.
ReplyDelete
Replies
Frank MorrisApril 26, 2018 at 12:52 AM
Nice post, Really liked it. Speech To Text Software is the most advance and great technology that ha been discovered, Really helped people to understand different language.
ReplyDelete
Replies
HorizonripperJune 27, 2019 at 4:05 PM
I wonder if the Home system could boot from the PC to a remote Pi and then be able to shuttle information from the Dragon Naturally/ Pc to the Pi3/4/remote? Is there an app for that yet?
ReplyDelete
Replies

Add comment

Desert Home

Pages

Tuesday, May 27, 2014

Voice Recognition on the Raspberry Pi - Reality Check

13 comments:

Total Pageviews

About Me