Google engineers have spent 1,000 man-years trying to understand the apple.
Intuitively, it's simple: An apple is the round thing hanging from the tree. But it's also the company that made the iPod, and the Beatles' record label, and the thing your neighbourhood farmer's market sells every Sunday.
When someone simply searches for "apple," how do you tell what it is they're looking for?
"When I built the first search system, it exploded in my face," says Amit Singhal, the man responsible for the code at the heart of Google's core search engine. "People were searching for apples and we were giving them the computer company."
The story of exactly what happens in the microseconds between when a user clicks "Search" and when Google presents its results is the tale of one of the most fiendishly difficult puzzles of the digital age: How do you sort the Internet?
But telling apples from Apples is more than just an intellectual challenge. It's also the key to Google's $160-billion (U.S.) empire. While the company has expanded into everything from smart phones to social networking, the heart of its business remains the provision of quick, accurate answers to people's search queries. Beside those answers, more often than not, Google will display ads related to the queries - ads from which Google derives the majority of its $24-billion-a-year revenue.
Earlier this month, the company announced a complete overhaul to the way it indexes the Web. Called "Caffeine," Google's new algorithm collects information about hundreds of thousands of web pages every second. The resulting database takes up a massive 100 million gigabytes of space.
There is an excellent reason why Google is willing to go to such lengths to improve its search engine: There is no loyalty in search. If a better search engine were to come along, users would flock to it, and Google's dominance would disappear.
For the first time since Google began eviscerating the competition in the search market a decade ago, a new crop of sites are threatening its position. Millions of users are turning to sites ranging from Twitter to Facebook, Amazon to eBay, to look for breaking news, restaurant recommendations or shoes for sale. Apple is starting to leverage its massive mobile applications ecosystem to run an advertising business aimed squarely at taking revenue away from Google's search advertising. Suddenly, Google's list of potential competitors has exploded.
The most high-profile of those competitors is Microsoft, which launched its Bing search engine last year. Google still dominates the search business - it controls about two-thirds of the key U.S. market - but Bing's initial growth, which saw it grab about 11 per cent of the market, caught Google off guard. To maintain its dominance, Google needs to remain a step ahead of its rivals when it comes to search technology.
A rare series of interviews with Google's top search staff reveal how difficult a job that is. From dealing with synonyms to understanding context to monitoring real-time data, Google's chief engineers have to grapple with the challenge of how to monitor every bit of information in the world.
What they have their sights trained on is the holy grail of search: Figuring out what you're looking for, even if you don't know what you're looking for. Meeting that challenge will cement Google's No. 1 position among search engines. Failing to meet it will open the door to rivals that want a piece of one of the planet's most lucrative businesses.
Thinking of investing in Google?
'No quick fixes'
Udi Manber looks satisfied - Google has just learned Maltese.
The Israeli-born former computer science professor is Google's vice-president in charge of core search functions. That includes the company's efforts to translate all the world's information - something it already does passably to proficiently in about 70 languages. The addition of Maltese to that arsenal affects only a tiny fraction of Google's users, but it's emblematic of the company's determination to leave no part of the global data stream unmapped.
Google ran about 6,000 search-related experiments last year, and made about 500 tweaks to its search engine. The tweaks ranged from fundamental changes to the way the engine determines the quality of a website, to improvements in the Hebrew spell-checking software. With each change came countless hours of debugging and testing.
"For a piece of software like a word processor, you can say this works or this doesn't work," says Scott Huffman, who runs Google's search quality testing division.
"In search, almost anything will hurt some queries and help some queries. Almost anything gets some wins."
At the heart of all these changes is one simple rule: no Band-Aid solutions.
"Nobody is allowed to change one query," Mr. Manber says. "They must change the algorithm."
That means if, for some reason, a search for "Microsoft" returns the General Motors website as the first result, engineers can't just go fix that one problem - they must find out what went wrong in the core search code, and fix that. Otherwise, the list of Band-Aid solutions would quickly become unwieldy thanks to the billions of English web pages out there - not to mention their Maltese translations.
The "no quick fixes" rule is in many ways a direct rejection of how Internet search began. More than a decade ago, when websites such as Yahoo ruled the search landscape, some search engines were entirely human-driven. Employees would scour the Web for high-quality information and index it manually. With just thousands of good websites online, the process didn't seem all that ludicrous.
Indeed, when Mr. Singhal began his graduate work some 20 years ago, the biggest problem researchers had was too little information. In order to test their search algorithms, Mr. Singhal and his colleagues would rely on bundles of documents on CD. At a time when researchers were testing their search algorithms on blocks of a few thousand documents, few people anticipated a world where the search area would consist of trillions of pieces of information.
Google's co-founders, Larry Page and Sergey Brin, did foresee the future - but only sort of.
Their original Google search engine, developed in the late 1990s, depended on "signals," which are basically clues as to how relevant a web page is to a search query. Clues can come in many forms, such as how often previous users searching for the same thing flocked to the same page, or whether the words in the query show up in the web page's title.
Perhaps the most famous signal in the original algorithm is PageRank. Named after Mr. Page, the algorithm calculated the quality of a web page by measuring the number of sites linking to that page. Google refers to PageRank as a sort of search democracy, with each link a vote.
The original system worked. It worked so well, in fact, that Google overtook rivals such as Yahoo for the title of Internet search king But the algorithm had a fundamental flaw that was in some ways similar to the flaw in Yahoo's human-driven approach: It didn't envision how big and complex the Internet would get.
Mr. Singhal's rewrite of Google's core engine code, which replaced the founders' code in 2001, didn't do away with the concept of signals, it simply assumed that there were more signals out there that Google had yet to figure out. In effect, Google's engine now had the ability to make use of new signals that may come along in the future.
More than anything, it was this change that allowed Google to dominate search, as the Web changed from a few million static websites to billions upon billions of pages, blogs, videos and tweets.
Thinking of investing in Apple?
The Re-Tweet Rate
The rise of social networking demonstrates why the ability to process multiple signals at lightning speed is important. "When we crawled documents, we had minutes if not hours to analyze them," Mr. Singhal says of the early days of search. "Suddenly, we moved into a world where we had two seconds."
The biggest problem is determining relevance. Services such as Twitter and Facebook present fleeting information, ranging from status updates to 140-character tweets. With hundreds of millions of people compulsively using such services, sorting out the noise becomes especially difficult.
To start solving that problem, Google turned to an old staple: PageRank.
Just as Mr. Page's algorithm determined the value of web pages by the number of pages linking to them, Google values Twitter users by the number of followers they have. The thinking is the same: People would not flock to a source of information if it wasn't any good.
But there's a glaring flaw in that system. Sites such as Twitter are most useful for their immediacy in breaking news situations - what if the person who just happens to be at the dock when the plane crashes into the Hudson river only has two or three followers?
Google handles these situations by including yet more signals into the algorithm - signals that, in some cases, didn't even exist a few years ago. The first is the re-tweet rate, or how quickly other users reproduce a piece of information - the faster something spreads, the more important it must be.
Google combines the re-tweet rate with information about the user's geography to create a sort of heat map. If a plane goes down in the Hudson river, and a piece of information originating from the Manhattan waterfront begins spreading at lightning speed, odds are it's something first-hand and worth noting.
The importance of knowing a user's location helps to explain why Google has rolled out an aggressive mobile phone strategy, including an operating system for smart phones and even its own handset to compete with the BlackBerrys and iPhones of the world. While many assumed the company was simply trying to cash in on one of the fastest-growing retail sectors in the world, there was a second reason: Smart phones have the ability to pinpoint where a user is.
Imagine a user in downtown Vancouver searching for "farmers' market" - only now, Google knows exactly where that user is. The farmers' markets closest to the user in Vancouver show up much higher in the results page, and the search is much more useful for the user.
The ability to collect geographic information provides Google with perhaps the single most significant improvement to search quality since Mr. Singhal re-wrote the core engine some 10 years ago.
And it's one more step toward Google's ultimate goal - something Mr. Singhal describes as the holy grail of search.
Searches Gone Bad
There's a song by British electronica act Aphex Twin with a video that features a ridiculously long limousine.
If that's all you knew, how would you search for information about the song? You might simply enter what you know: "Music video with the long stretch limousine."
Sure enough, there it is: The second search result is the Wikipedia page for the Aphex Twin song, Windowlicker.
This is the goal that Google's search engineers are focusing much of their energy on - the ability to search for something a user can't name.
The challenge is always changing: If you line up all the queries Google gets in a day, and ignore duplicates, about one-third have never been seen before. To figure out what users are trying to find when they enter these never-before-seen queries, Google relies on "losses," or searches gone bad. Signs of a loss include a user who keeps coming back to Google with slightly modified versions of the same request, or someone who has to scroll through 10 pages of results before finding what they're looking for.
This is how Google tries to guess what you want, even if you don't know what you want. If thousands of people search for "shoes with toes" and eventually find their way to Vibram, the shoe company that designs footwear with individual pockets for each toe, the search engine notes the connection. In short, Google's search engine is moving from reactive to predictive. Just a few years ago, the notion of figuring out what a human being wanted even before that same human being figured it out would have been virtually impossible. Google hasn't yet built a computer that can think. But by indexing what billions of people search for, it has, with its search engine, built a near-passable facsimile.
Yet it must get even better. It's not just Facebook and Twitter that represent the next generation of Google's competitors. This year, Apple also jumped into the advertising business, launching an ad platform that runs on the hundreds of thousands of apps for its iPhones and iPads.
"If people want to find out what restaurant to go to, they're not going to their search engine and typing in "Japanese" and "Palo Alto," Apple CEO Steve Jobs said at a recent conference. Instead, he contended, users are turning to specific apps. If that's the case, Google's dominance may become much less relevant to both users and advertisers.
Google knows its survival depends on its ability to perfect the science of reading your mind. Ben Gomes, the 50th person hired at Google and the man who's responsible for the look and feel of the company's website, says the company is clear it must remain one step ahead of its competitors in the battle to deliver perfect search results.
"I've been here 10 and a half years," he says, "and search is still the heart of what we do."
Google still retains the lion's share of the search market in most parts of the world. But a new wave of potential competitors is starting to cut in to that dominance - in large part, because the definition of "search" is becoming much more broad. Here's a look at the competitors, and why they're a threat:
Why: With 400 million users, Facebook has become home to huge stores of content, including photos and video. The site is also a hub for personal information that often can't be found anywhere on the Web.
Importance: High. Facebook's size and, perhaps more importantly, its meteoric growth, make it likely to soon eclipse Google as the world's most-visited site.
Why: The microblogging site has become the go-to source for updates on breaking news, because often the first reports come from everyday people posting updates from their cellphones.
Importance: Medium. While Twitter's real-time model has forced companies such as Google and Facebook to play catch-up, the site's usefulness as a search engine is related mostly to breaking news.
Why: Microsoft's search engine represents the first real shot in years by a major company to challenge Google's core business. Bing caught Google executives off guard, with its focus on consumer services and user experience.
Importance: Medium. Bing still controls a small share of the overall search market. Yahoo has outsourced its search engine duties to Microsoft, and the two entities combined may still pose a threat.
Why: Increasingly, users are turning to search engines to find things to buy. Amazon's huge store of product information makes it a prime spot for consumers. Unlike Google, Amazon also lets users buy the products they search for directly from the same source.
Importance: Low. Even though Amazon may steal some users from Google, anyone who isn't looking to buy something is unlikely to use Amazon as a search engine.