Deconstructing Google Mobile's Voice Search on the iPhone

I’ve experimented with audio transcription lately, but always with big, clumsy humans. I’d happily use cyborgs speech recognition software, but even today, automatic conversion of voice-to-text is still flawed. Naturally, I was intrigued when Google announced they were adding voice searching to their Google Mobile iPhone app.

Google’s flirted with voice-to-text conversion in the past, with GOOG-411 and their Audio Indexing of political videos on YouTube. But this is the first time they’re offering a web-accessible interface for speech conversion, albeit completely undocumented, so I decided to poke around a bit to see what I could find.

Over the last few hours, I’ve been analyzing the traffic proxied through my network, trying to reverse-engineer it to get to something usable, but I’ve hit my limits. I’m posting this with the hopes that someone out there can run with it and find out more.

Behind the Scenes

Here’s what we know so far: When you first start speaking into the microphone, the app opens a connection to Google’s server and starts sending over chunks of audio, almost certainly encoded with the open-source Speex codec.

The waveform image is generated on the phone and displayed along with a “Working” indicator and the adorable “beep-boop” sounds. In the background, a tiny file is being sent as a POST request to http://www.google.com/m/appreq/gmiphone. Here’s what the headers look like:

POST /m/appreq/gmiphone HTTP/1.1

User-Agent: Google/0.3.142.951 CFNetwork/339.3 Darwin/9.4.1

Content-Type: application/binary

Content-Length: 271

Accept: */*

Accept-Language: en-us

Accept-Encoding: gzip, deflate

Pragma: no-cache

Connection: keep-alive

Connection: keep-alive

Host: www.google.com

The response from Google is an even smaller attachment. These two files are the same for every query, so don’t contain any meaningful information.

HTTP/1.1 200 OK

Content-Type: application/binary

Content-Disposition: attachment

Date: Tue, 18 Nov 2008 13:06:53 GMT

X-Content-Type-Options: nosniff

Expires: Tue, 18 Nov 2008 13:06:53 GMT

Cache-Control: private, max-age=0

Content-Length: 114

Server: GFE/1.3

After the audio’s sent to Google, they return an HTML page with the results and a second request is triggered, this time a GET request to clients1.google.com with the converted voice-to-text string.

GET /complete/search?client=iphoneapp&hjson=t&types=t

&spell=t&nav=2&hl=en&q=chicken%20soup HTTP/1.1

User-Agent: Google/0.3.142.951 CFNetwork/339.3 Darwin/9.4.1

Accept: */*

Accept-Language: en-us

Accept-Encoding: gzip, deflate

Pragma: no-cache

Connection: keep-alive

Connection: keep-alive

Host: clients1.google.com

The response is an array of search terms in JSON format, for use in search autocompletion.

["chicken soup",[["http://www.chickensoup.com/","Chicken Soup for the Soul",5,""],["http://www.chickensoupforthepetloverssoul.com/","Chicken Soup for the Pet Lover's Soul",5,""],["chicken soup recipe","489,000 results",0,"2"],["chicken soup for the soul","1,470,000 results",0,"3"],["chicken soup dog food","462,000 results",0,"4"],["chicken soup with rice","467,000 results",0,"5"],["chicken soup diet","453,000 results",0,"6"],["chicken soup from scratch","364,000 results",0,"7"],["chicken soup for the soul quotes","398,000 results",0,"8"],["chicken soup crock pot","604,000 results",0,"9"]]]

Help!

Unfortunately, until we can isolate and decode the audio stream, playing with the voice recognition features is out of reach.

Any ideas on cracking this mystery would be hugely appreciated. Anonymity for Google insiders is guaranteed!

Updates

As several commenters figured out, and confirmed to me by Google, the audio is being sent to Google’s servers for voice recognition. The two binaries I posted above aren’t the actual transmission, and are actually identical for every query, so can be disregarded. Sorry about the red herring.

Gummi Hafsteinsson, product manager for Google’s Voice Search, says, “I can confirm that we split the audio down to a smaller byte stream, which is then sent to Google for recognition, but we can’t really provide any details beyond that.” Responding to my request for a public API, he added, “I appreciate the suggestion to provide voice recognition as a service. Right now we have nothing to announce, but we’ll take this feedback as we look at future product ideas.”

Also, Chris Messina discovered some secret settings in the application’s preferences file, including alternate color schemes and sound sets for “Monkey” and “Chicken.” Beep-boop!

Next step: As Paul discovered in the comments, the Legal Notices page says clearly that the app uses the open-source Speex codec for voice encoding. Can anyone capture and decode the audio being sent to Google?

November 19: I rewrote most of this entry to reflect the new information, since it was confusing new readers.

Comments

    I have to agree with JB that the request is probably a set of phonemes. The request seems to contain a header, and then 7 “g:log:ev” records. This matches the count of phonemes in the phrase “chicken soup”: tʃʰ ɪ k n s u p.

    If you started submitting words or phrases with similar sounds in them, you might be able to make a sort of lookup table which maps the phoneme to google’s binary representation. However I am afraid that if you were wanting to be able to use their service, it looks like the speech-to-text is all done in the app itself.

    Also, I don’t think that the binary response you posted contains the actual response. Looking at it in a hex editor shows pretty clearly that it could not possibly contain the words “chicken soup”. It looks more like a counter or something.

    –Brandon

    I agree with jb. There is little chance that they compressed the actual waveform into 100-300. Instead they must have done some preprocessing on the iPhone itself. Like jb suggested, they most likely broke it down into phonemes and sent enough data just to describe that, but they could not listen to the actual waveform. Makes sense to me.

    The actual conversion from phonemes to words is then left up to the servers, where they can update dictionaries and whatnot.

    I guess the moral of the story is, if you are trying to reverse engineer it, expect things other than actual wavforms. Maybe try saying “OOOH” then “AAAH” and see what happens. Then go for simple syllables “PAH”, “KAH”, “MAH”. See how consistent it is… personally, I really would not spend to much time on it, unless you are *realyl* into it. 🙂

    Good luck!

    Clearly, its something magical going on behind the scenes. I would hazard a guess at Captain Samuel Hogberts: Phenomic Bi-Transcendental Audio Scribeographer.

    Without looking at it, my guess would be very-low-bandwidth Speex encoding. But the phoneme theory is easy to test — just say the same thing a bunch of times and see if the same bits ever get sent twice.

    I would suggest it is an acoustic fingerprint that gets transmitted and Google’s databases convert that into the correct term(s), drawing on their most popular search terms (for specific locales).

    I would guess the data sent up to Google is a ‘feature extracted’ representation of the source audio. This is a typical early step in ASR. The waveform is segmented into frames, then a series of observation vectors (Mel Frequency Cepstrum, Acceleration, etc) are calculated. The stream of these observations can be much smaller than the source audio.

    At first, phoneme evaluation seemed like the obvious approach. I’m curious though because I know from using it that GOOG-411 handles voice recognition on the server side over generall lousy cell phone connections. Along the same lines as Aaron’s suggestion, if you aggressively clamped down on the frequency range and used something like Speex, is it plausible to get a two or three word phrase down to 300 bytes? I don’t really know, but it doesn’t seem unreasonable.

    The response to the first request probably contains a list of possible phrases, maybe in a compressed form. If you tap on the green search term, you get a menu of other results.

    Is it possible Google doesn’t bother with (oldschool) phonemes at any stage? I can imagine they might have a superset of ~1000 googlephonemes… but if the phone can compress three words down to 200 chars, it _could_ do oldschool phonemes, if it wanted to.

    300 chars could also mean oldschool phonemes, plus additional acoustic-fingerprint-style codes?

    Can’t help you, but Google should offer an official API for this. They could power the voice interface for every mobile device / app… and in the process, collect tons of valuable data on how people search & interact with applications outside of their own products.

    While the request file is certainly mysterious, the response is even stranger. After a two-byte header (0x0001) the thing is a sequence of seven sixteen-byte lines:

    0x0000000A8100000000C8000000000000

    0x0000000A8100000100C8000000000000

    0x0000000A8100000200C8000000000000

    .

    .

    .

    the only thing that changes is the eighth byte keeps getting incremented by 1. This is hardly enough information to spell out “chicken soup” in any possible way. It is also not any compression scheme I can possibly imagine.

    I would suggest that this may be confirming the phone software’s guess. I wonder if the response would be the same every time for the same phrase? Or what if the word sent was something between “mode” and “moat” – /məʊd/ or /məʊt/. This may cause Google to send slightly different phonemes every time.

    You can also try something like afsadafsa – pronounced /æfsəˈdæfsʌ/ or so — which should be easy enough to break into phonemes, but would yield a “did you mean?” suggestion from Google.

    I agree with Ilya Haykinson, the first response does not seem to contain enough information to spell out “chicken soup”.

    I would imagine the speech-to-text is performed on the iPhone itself, and this triggers the 2nd GET request with the query text.

    The encoded speech is probably sent to Google for additional analysis, to continue what they started with GOOG-411.

    From Garett Rogers ZDNet article: “In reality, Google’s 411 service is about training a powerful speech-to-text engine that will one day find itself in things like video search. The more sample audio they have (people looking for businesses), the more accurate their system will become.” [http://blogs.zdnet.com/Google/?p=852]

    It’s got to be a first layer HMM on the phone. I’ve been waiting for google to do this for years. You don’t need to know the semantics of the speech at all. Just match the state trajectory of all those search phrases. You can use a TTS engine on all that nice google text search data to make a rough matching set to pattern match against.

    It’s not so much the phonemes, as the transitions between phonemes. That trajectory can be encoded very efficiently.

    I just did several more tests with different phonemes and long phrases, and the binary request and response attachments are identical. No difference at all.

    This leaves only two possibilities:

    1. All voice recognition is happening on the iPhone itself, with the little 2MB iPhone app, and without sending anything to Google’s servers.

    2. Somehow, I’m missing some key transmission in both my proxy debugger and Wireshark.

    The first option simply makes no sense to me, so I’m going to assume that the second one is right. Can anyone try to repeat my network analysis and post your results?

    I just sniffed my network while the iPhone was doing the voice search, and I noticed a ton of traffic going to a google server on port 19294. When I was speaking into the iPhone microphone, the number of packets transmitted over that TCP connection increased substantially. If you were just sniffing port 80 or looking for “POST” you would have missed it. Even when you’re not talking into the mic, it’s constantly sending about 32 bytes of data back and forth between the phone and the server.

    Looking inside the packets, I noticed a couple of interesting things. First, this part comes from the phone:

    
    http://www.google.com:80/m/voice.x..
    
    .hl..en..
    
    .gl..us..
    
    .v..0.3.142.951..
    
    .ie..UTF-8..
    
    

    (that’s part of a direct dump from wireshark)

    After a whole bunch of binary data is exchanged, the server sends down a full HTML document, over this TCP connection. Inside this HTML document are the search terms.

    Not like it’s going to be of much interest, but I wonder if when you’ve turned on voice search and you’re not speaking, the silence.wav file is being transmitted (it’s only 124 bytes and can be found in Google 0.3.142.ipa/Payload/Google/) (there are other WAV files in that directory).

    I also found references to Sony Sound Forge, but I also saw references to ImageReady and Fireworks. Not that interesting.

    There’s also a curious localization string: “Search only works in English, and works best for North American English accents.”

    And, in the Preferences.plist file, there are loads of goodies:

    http://www.pastie.org/318386

    Check out the array starting #323, this is the array for debugging the app! There are references to a “kGMOPrefVoiceSearchServer” option (342) and to “kGMOPrefLogUtterances” (352) which is an option to “Log Utterances to Disk”.

    The next array, at 376, is the “Bells and Whistles” set of options, referred to as “kGMOPrefGroupSecretSettings”. Here you get to change your color theme (391), pick a soundset (396): Default, Monkey or Chicken, turn on the “Live Waveform” (413), or “Open Links in App” (423).

    Pretty neat, though might not help you much!

    Any chance the 300 bytes are mfcc vectors as used in hmm based speech recognition? See for example project Aurora from ETSI a few years back. Spectral processing in the phone, which has to do something like this anyway to make the speech compression used in mobile phones work, followed by transmission of the mfcc vectors to a server which does the heavy lifting of the tree search needed for recognition. The numbered lists might be some kind of n-best list of recognition candidates.

    Reading the comments gives two possibilities :

    1) the 300-byte for 3 words theory is true

    It takes about 1 sec to utter 2-3 words and at 16kHz sampling that is 16k samples. Generally frames are 10-20 ms and overlapped half. That gives 100-200 frames. Each frame is subjected to 13 MFCC features and 13 deltas and 13 double-deltas – total 39 float/double values. Now they can use a very high-dimensional feature vector quantization codebook to compress it. Typical sizes are 256-1024, but they degrade the performance, so Google can go upto 32K size codebook minimizing quant noise. That gives 16 bits or 2 byte per frame or 200-400 bytes for the spoken phrase.

    2) As Matt observed and Andy hypothesised and i also feel, they must be doing the speech reco on their sturdy data centers using more complex models trained on tons of 411 samples. That would require them to send the features in float-compressed form or even the speech samples. That would be quite some data.

    With only a few hundred bytes per utterance, they are probably doing vector quantization of MFCCs on the phone itself. Sending the floats would take too much room. There is really no way they could be sending actual speech samples in only 300 bytes. One of the earlier posts said that there were other ports active during transmission, so it is possible that the entire 32KB of a one second utterance are sent that way, and that all processing really is done at the data center.

    A good part of the trick to these kinds of applications is to restrict what the recognizer actually has to do, and then make it look smarter than it is through a clever user interface. Google may have a specialized grammar that expects queries of just a few words, in a more or less standard format that people who have used Google before intuitively use. It would be interesting to see if speaking the query as a longer natural language utterance would make things fall over, or at least increase failure rate (mis-recognition). A longer utterance could also be used to probe the maximum utterance length they allow. Maybe 2-3 seconds or so? There is apparently a form of pacing, where the recognition window opens with a beep.

    It’s all very general, but at first glance, it seems they may in fact be transmitting the entire utterance to a remote server for processing, possibly on some odd port. i.e., the phone does almost no processing. If this is so, they could presumably have this service for any device with speech capture and a network connection, not just iPhones – PCs, laptops, etc. A new toolbar in your browser and away you go. All sorts of possibilities.

    United States Patent Application 20080243501

    October 2, 2008

    Location-Based Responses to Telephone Requests

    Abstract

    A method for receiving processed information at a remote device is described. The method includes transmitting from the remote device a verbal request to a first information provider and receiving a digital message from the first information provider in response to the transmitted verbal request. The digital message includes a symbolic representation indicator associated with a symbolic representation of the verbal request and data used to control an application. The method also includes transmitting, using the application, the symbolic representation indicator to a second information provider for generating results to be displayed on the remote device.

    You will find the probable explanation of the lists of numbers here. Most likely, they are the n-grams described in the body of this patent application, arranged in order of probability of occurrence.

    One prospect hinted at – this is apparently also how they do language translation for Web pages. Now they have speech-enabled that process…

    United States Patent Application 20080262828

    October 23, 2008

    Encoding and Adaptive, Scalable Accessing of Distributed Models

    Abstract

    Systems, methods, and apparatus for accessing distributed models in automated machine processing, including using large language models in machine translation, speech recognition and other applications.

    Steven is right I believe.. Even recording at 8kHz with good compression it would seem impossible to compress 1-2s worth of audio (best case BTW) into 300 bytes or less IMHO.

    One way to test is to use this over Edge. Sure it would be round-about but if they are sending something larger than 300 bytes using some other port, the lag will be quite noticeable when switching over from wifi… My initial tests seem to point in this direction.

    Why not just POST over port 80 I wonder..?

    Although there is a lot of buzz about iPhone and this new Google voice app, I saw a version of this capability five years ago. It was running on a Windows handheld device that talked to a central server over the Web for recognition and semantic processing – spoken Web queries. As I recall, the handheld did some front end processing and sent vector-quantized MFCCs back to the server, encapsulated in some clever HTML response. Google appears to be using raw digitized audio sent over some obscure port, with their servers doing all the work.

    iPhone is mightier than iPaq, I guess.

    United States Patent 7,050,977

    Bennett May 23, 2006

    Speech-enabled server for internet website and method

    Abstract

    An Internet-based server with speech support for enhanced interactivity is disclosed. This server hosts a server-side speech recognition engine and additional linguistic and database functions that cooperate to provide enhanced interactivity for clients so that their browsing experience is more satisfying, efficient and productive. This human-like interactivity which allows the user to ask queries about topics that range from customer delivery, product descriptions, payment details, is facilitated by the allowing the user to articulate the his or her questions directly in his or her natural language. The answer typically provided in real-time, can also be interfaced and integrated with existing telephone, e-mail and other mixed media services to provide a single point of interactivity for the user when browsing at a web-site.

    If you look at the About, you’ll see they reference Speex. It seems pretty clear this is using chunked Speex encoding, passed up to their server.

    When I first saw this article (prior to the update) I said “No way – there’s no way to do all that locally, without a massive database”, so I did much of the same poking around, and found the audio transmissions. I haven’t analyzed them much, but Speex can be found mentioned in the Legal documents.

    From what I read Flash 10 has a Speex codec. That would make a desktop client for Google Voice Search possible, wouldn’t it? O, Lazyweb, I urge you…

    Just for kicks I decided to try reading some sentences from the comments into the Google Mobile app and they definitely simplified the grammar coupled with most probable search results in some fashion. As a general understand everything speech-to-text engine it looks rather crippled. As an option to initiate Google searches from your desktop via a mic I think it would be amazing. You can see when reading the sentences that only the larger words are picked out which follows what you do with Google searches, no “a, I, the, of, etc” along with larger more common words that make sense as often searched terms. I would LOVE a desktop app to do this, press a shortcut key, say something and have a browser launch with search results, or even search through my Gmail account. That would truly change how I interact with my computer. My only question is will they bless Linux with a copy 😀

    This thread is a year and half old now, any developments? I too want to be able to use google voice search on my desktop pc.

    I would like to use my own url instead of using google search, actuality, picture or wikipedia.

    Do you know if there is the possibility to change a config file somewhere in order to add a new web site in addition to the one listed above ?

    Basically, once my voice recognised, i would like to run something like http://www.mywebsite.com/?sentence=%1

    assuming that %1 will be automatically replaced by the text recognised.

    I already investigate the fact to change the hots file (etc/hosts/) in order to redirect each google call by my own ip. The problem is that you can only redirect an url to an ip and in most of the case, the main domain is returned and not the subdomain. For example i would like to use jGate (http://apps.jgate.de/) in order to host my RestFullApi. Even when you find the ip associated to your application (ie http://mywebsite.jgate.de/), jgate.de is returned so you can’t use the ip address to my hosts file.

    Any idea on how to use our own engine ?

    Thanks in advance for your help.

    Johnny B (from france)

Comments are closed.