Showing posts with label Voice. Show all posts
Showing posts with label Voice. Show all posts

Tuesday, May 27, 2014

Voice Recognition on the Raspberry Pi - Reality Check

My recent experience with voice control on the Pi got me to thinking.  Why wasn't this a rising star and constantly being talked about?  The idea of talking to your house is so compelling that there must be hundreds of implementations out there.

Well, there are, and none of them work very well.

I described my experience with Jasper, it just didn't live up to the hype.  So I went looking and did some experimenting.  Everyone talks about how good Siri is.  My experience with it is far less than stellar; all the phones I've tried it on misunderstand me about 6 out of 10 times.  Google's implementation seems to work better and I get about an 80% success rate.  Both of these are stellar compared to several software techniques I tried out, with the absolute worst being CMU Sphinx that Jasper was based on.

Remember, I'm looking at this as a way to control my house with a little computer, not dictate letters wearing a headset, so let me talk a bit about methods.  No, I'm not going to bore the heck out of you with a dissertation on the theories of voice recognition, I want what everyone else wants: I want it to work.  There are basically two methods of doing speech recognition right now, local and distributed.  By local I mean totally on one machine, and distributed is when they send the sound over the internet and decode it somewhere else.  Google's voice API is an example of distributed and CMU Sphinx is an example of local.

What we all want is for it to operate like Star Trek:

"Computer."

Nice clear beep

"Turn on the porch lights"

Nice clear acknowledgement, and maybe a, "Porch light is now on."

I went through the entire process of bringing up CMU Sphinx <link>, and when I tried it, I saw something on the order of, "Burn under the blight."  To be fair, Sphinx can be trained and its accuracy will shoot way up, but that takes considerable effort and time.  The default recognition files just don't cut it.  Especially when I tried the same thing with 100%, yes totally accurate results with Google's voice interface.  The problem with Google's interface is that it only works in the Chrome browser.  Yes, there are tools out there that use the Google voice API; notably VoiceCommand by Steve Hickson <link> , but expect it to quit working soon.  Google ended their offering of version 2 of the interface, and version three is limited in how many requests can be used and you have to have a special key to use it.  Thus will end a really cool possibility, I hope they bring it back soon.

So, the local possibilities are inaccurate and the distributed are accurate, but the one everyone was using is likely to disappear.  There are other distributed solutions, I brought up code taken from Nexiwave <link> and tested it.  There was darn near a 100% success rate.  The problem was delay.  Since I was using a free account, I was shuffled to the bottom of the queue (correctly and expectedly) so the response took maybe three seconds to come back.  Now, three seconds seem like a small price to pay, but try it out with a watch to see how uncomfortable that feels in real use.  This is not that Nexiwave is slow, it's that the dog gone internet takes time to send data and get back a response.  I didn't open a paid account to see if it was any better, this was just an experiment.

But, think about it a bit.  "Computer,"  one thousand and one, one thousand and two, one thousand and three, "Yes".  Then the command, "Turn on the porch light", etc.  It would be cool and fun to show off, but do you really want to do it that way?  Plus it would require that the software run continuously to catch the occasional, "Computer" command initiation.  Be real, if you're going to have to push a button to start a command sequence, you might as well push a button to do the entire action.  Remember, you have to have a command initiator or something like, "Hey Jeff, get your hand out of the garbage disposal, it could turn on," could be a disaster.  A button somewhere labeled, "Garbage Disposal," would be much simpler and safer.

Don't talk to me about Dragon Naturally Speaking from Nuance <link>.  That tool is just unbelievable.  It is capable of taking dictation at full speed with totally amazing accuracy, but it only runs on machines much larger than a Pi, and not at all under Linux.  Even their development version is constructed for Windows server machines.  Microsoft has a good speech recognition system built right into the OS, and under Windows 8, it is incredible.  Especially at no additional cost at all.  But, there aren't many Raspberry Pi machines running Windows 8.

Thus, I don't have a solution.  The most compelling one was Nexiwave, but the delays are annoying and I don't think it would work out long term.  Here's the source I used to interface with it:

#!/usr/bin/python

# Copyright 2012 Nexiwave Canada. All rights reserved.
# Nexiwave Canada PROPRIETARY/CONFIDENTIAL. Use is subject to license terms.

import sys, os, json, urllib2, urllib, time

# You will need python-requests package. It makes things much easier.
import requests

# Change these:
# Login details:
USERNAME = "user@myemail.com"
PASSWORD = "XYZ"

def transcribe_audio_file(filename):
    """Transcribe an audio file using Nexiwave"""
    url = 'https://api.nexiwave.com/SpeechIndexing/file/storage/' + USERNAME +'/recording/?authData.passwd=' + PASSWORD + '&auto-redirect=true&response=application/json'

    # To receive transcript in plain text, instead of html format, comment this line out (for SMS, for example)
    url = url + '&transcriptFormat=html'


    # Ready to send:
    sys.stderr.write("Send audio for transcript with " + url + "\n")
    r = requests.post(url, files={'mediaFileData': open(filename,'rb')})
    data = r.json()
    transcript = data['text']
        
    # Perform your magic here:
    print "Transcript for "+filename+"=" + transcript


if __name__ == '__main__':
    # Change this to your own
    filename = "/data/audio/test.wav"
    
    transcribe_audio_file(filename)
I took this directly from their site and posted it here because it is hard to find, and I don't think they care if I advertise for them.  All I did to make it work was to sign up for a free account and enter my particulars in the fields up at the top.  It worked first try; simple and easy interface.  It would be relatively easy to adapt this to a voice control system on my Pi if I decided to go that way.  Which I may do for control in the dark of my bedroom where I don't want to search for a remote that may be behind the side table.

The audio file I sent was my usual, "Porch light on," and it decoded it exactly first try.  I tried a few others and they all worked equally well.  Which brings up another item, sound on the raspberry Pi.  Frankly, unless you're dealing with digital files and streams, it sucks.  There isn't enough filtering on the Pi to keep audio hum out of things.  The amplified speakers I was using had a constant low level hum (regular ol' 60 hertz hum), and it would get into the audio captured from the USB microphone as well.  This could have been reduced by an expensive power supply with very good filtering, or maybe not; I didn't try.

To add insult to an already injurious process, ALSA (Advanced Linux Sound Architecture) is the single most confusing sound implementation I've ever seen.  It was constructed by sound purists and technology students so it is filled with special cases, odd syntax, devices that mostly work, etc.  The documentation is full of 'try this'.  What?  I love experimenting, but I sort of like to have documentation that actually has information in it.  Pulse audio is another possibility, but I'll approach that some other time.  Maybe a few weeks after hell freezes over, ALSA was bad enough.  But, if you're going to experiment with sound under Linux, you'll have to deal with ALSA at some point.  Especially if you actually want to turn the volume up or down.

I think I'm going to do some research on remote control ergonomics.  There's got to be a cool and actually useful way to turn on the porch lights.

Sunday, May 25, 2014

Jasper for voice commands

I'm rapidly getting to the point that I need to have a simple remote control that I can use around the house.  I've thought about several possibilities, but I ran across Jasper <link> and that would be so cool.  A house that I can talk to.

On their site, they have a great video of the microphone setting a few meters away and the authors gleefully giving it commands and listening to the responses.  I could make a few of these and scatter them around the house to control things by voice; that would really be nice.  Very slick site and seemingly well documented.

So, looking at the directions, they have a pretty complex system that installs voice recognition and speech capabilities on the Pi.  I decided to try it out.  Since their image file is most likely configured differently than what I use, I decided to do the full installation ... DON'T DO THIS.  Their instructions leave a lot of tiny little things out, like what directory to be in when you do things, how long the various steps can take, how freaking big it is.  Y'know things like that.

After 26 hours of installs, updates, compiles (some of which failed) and a whole lot of head scratching, I just gave up.  I still wanted to try it out, so I downloaded their disk image and installed it on my Pi.  After spending quite a while messing around, I still couldn't get it to respond to a command or hear what it was saying; obviously there was something wrong with the audio setup.  The authors used ALSA (google it) so I started digging into it to see how to test the audio.  After changing the terminal interface on Putty several times, I discovered that my alsamixer settings were all set to the minimum.  After jacking up the input gain and the output volume, I managed to get sound into and out of Jasper.

Then the disappointments really began.  I couldn't recognize my voice most of the time, when it did manage to get it, it was mostly wrong.  "What is the meaning of life," one of the built in commands, would check my mail.  Interesting, but not what I expected.  It would do the time pretty well, but most everything else would cause an exception in the python script that was running.  Sometimes, it would fail so bad the Pi actually crashed.  Nothing has done that to me before.  Remember, this is a disk image, I didn't change anything; it failed right out of the box.  The voice synthesis was really, really hard to understand, and playing over and over again didn't seem to help.  My ears just weren't training to the odd sound.  Sort of like the Pi had a really bad cold and couldn't quite get the words out.

I didn't expect it to work like the movies, but c'mon, it should at least be as good as the videos they produced to show it off.  I even used the exact same microphone they show on their site and some really good powered speakers.  I can't blame the hardware at all.

So, I know there are a number of folk out there that installed this and got it to work.  There always are, but obviously, they didn't build it on their Pi, because that simply doesn't work, and not many people have days to waste compiling thousands of source files.  So, they must have used the image, but what the heck did they do to get it to understand them?  Similarly, how the heck did they understand what was being said by the software.  A series of beeps would have been better.

A few hours spent looking at various things on the web didn't reveal any secrets to make this work.  Instead, I found a significant number of people that hadn't been able to make it perform.  There didn't seem to be any solutions either.  There were a lot of, "I think," or "Maybe you can try," and "Have you tried;" but I discount most of those since they are simply guesses.  I didn't find anyone that was bragging about their success; that is very telling.  I'm not willing to dig into it because the vagaries of ALSA are daunting enough without having to delve into the secret and mysterious world of PocketSphinx (the recognition system they use).  I'm not interested in a new career.

I'm burning my original software back on the Pi's card right now so I can try something else.  For you folk out there that want to try it, I certainly hope you have better luck than I did.