Fluffy Bunny Burrows Into WinSock
From CommerceNet Wiki
by Rohit Khare and Kragen Sitaker; see also Publications
(We made some detailed notes on Deciphering Fluffy Bunny that back up the following discussion.)
Google Desktop Search quite reasonably installs itself as a webserver running on localhost. That is, if you point your browser to the special address http://127.0.0.1/, it displays a web interface served up by the Google code running on your desktop, not from Google Inc's data centers. That certainly seems straightforward enough.
The really jaw-dropping hack is that the same desktop results are seamlessly merged in when you __do__ aim a query at Google Inc's data centers. You've probably already heard about this nifty feature, since that finally hooks desktop search to a revenue stream for Google, something that's eluded the $99 indexer startups. How did they do this without violating users' privacy?
My first guess would have been that way this works is that Google's server farm just sends pages that link to Javascripts loaded from localhost. Just like you can go to one website and see images loaded in from another, your browser is perfectly willing to load a SCRIPT SRC= from localhost that prints out the current results.
That would be a fairly "loosely-coupled" approach: Google doesn't have to modify your browser or operating system, it just takes advantage of an established mechanism for letting separate web servers do their own jobs.
Instead, we found that Google Desktop Server actually hooks into Windows' TCP/IP stack to directly modify incoming traffic from Google's websites to splice its local results in. Once you install GDS, there's a bit of Google's code running inside every Windows application that talks to the Internet.
It's done using a long-established hook in WinSock2, the Layered Transport Service Provider Interface (SPI). You can read more about it in a 1999 article from Microsoft Systems Journal, | Unraveling the Mysteries of Writing a Winsock 2 Layered Service Provider.
They appear to be fairly careful about it: it would appear that their special purpose code looks for comments like hidden in query result pages from Google Inc's data centers (there's a long list of their domain names in the binary) and then only if the calling application is one of a few browsers (AOL, MSN, NeoPlanet, Opera, AvantGo, Firefox, Netscape, and of course, MSIE).
Still, running inside another application's address space is a risky way to ship software. I'm not sure why Google chose this:
- Speed? By intercepting outbound queries, it gives the local search engine a 'head start' to begin computing results before the HTML even comes back from Google's data centers and begins to be rendered by the browser.
- Perceived Security? Google has been famous for their backwards-compatibility with every little browser out there. They barely use Javascript and CSS, and then only when it's sure to be understood by the calling User-Agent. This way they can say "GDS doesn't require insecure browsers to run active scripts or plugins" -- though this seems fairly weak to me once you're asking users to install .DLLs instead.
- Trade Secrets? By using a necessarily proprietary interface between the public and local search engines, Google makes it that much harder to reuse GDS as a platform for someone else's applications. While you can spider http://127.0.0.1/ traffic and deconstruct the HTML, it's not as likely you'll get a Web Services API of equal richness out of it.
Overall, I'm left wondering what Google learned that I didn't. I'm willing to believe that they had a much better reason than my three bullets above for taking such a (IMHO) drastic step as burrowing into WinSock...
(As an aside, I tried in vain for nearly an hour to find an old horror story I'd most definitely heard back in '96 when I was working at W3C on PICS and the early hype around content-filters to keep smut off the web. There was a poor programmer who had the misfortune to have two commented lines of code that contained an expletive split across two lines -- even with a /* and */ between them. Whenever it was FTP'd up from his home machine, the compiler on the work side would fail, and it took quite some time to establish that a corporate filtering package was actually trapping all of the IP traffic out of his PC and rewriting it transparently. However, I'm wondering if I'll ever be able to substantiate this story, because I couldn't find it in the RISKS-digest, where I'd darn well expect it to have showed up eventually...)
