How does a search engine rank millions of pages within 1 second?

Question

I understand the basics of search engine ranking, including the ideas of "reverse index", "vector space model", "cosine similarity", "PageRank", etc.

However, when a user submits a popular query term, it is very likely that millions of pages containing this term. As a result, a search engine still needs to sort these millions of pages in real time. For example, I just tried searching "Barack Obama" in Google. It shows "About 937,000,000 results (0.49 seconds)". Ranking over 900M items within 0.5 seconds? That really blows my mind!

How does a search engine sort such a large number of items within 1 second? Can anyone give me some intuitive ideas or point out references?

Thanks!

UPDATE:

Most of the responses (including some older discussions) so far seem to contribute the credit to "reverse index". However, as far as I know, reverse index only helps find the "relevant pages". In other words, by inverse index Google could obtain the 900M pages containing "Barack Obama" (out of over several billions of pages). However, it is still not clear how to "rank" these millions of "relevant pages" based on the threads I read so far.
MapReduce framework is unlikely to be the key component for real-time ranking. MapReduce is designed for batch tasks. When submitting a job to a MapReduce framework, the response time is usually at least a minute, which is apparently too slow to meet our request.

It doesn't, the sorting's already done. That's the whole point of the page rank algo. — blgt, Oct 03 '13 at 14:40
This thread covers that question pretty well: http://stackoverflow.com/questions/1298860/how-does-google-serve-results-so-fast — Andrew Hill, Oct 03 '13 at 14:40
@blgt: I understand PageRank sorts pages offline. However, Google still needs to calculate the "relevance score" online. (Even utilizing reverse index, a search engine still needs to calculate the relevance scores among the pages containing the query term, isn't it?) — user1036719, Oct 03 '13 at 14:42
Massive parallelism. It's not like a parallel task processing some part the query needs access to entire database, so this parallelizes very very well. All that is needed, that result rankings are comparable globally. — hyde, Nov 06 '13 at 07:41
It's not a technical requirement for MapReduce to take a minute. If the tasks are easy enough and massively distributed, why shouldn't it run in sub-second? — Cephalopod, Nov 06 '13 at 14:35
Ranking and sorting are not the same thing. Pages can have independent 'ranks' (like an absolute measure of incoming links). In that sense, you could pool objects with highest ranks and treat them as a single cluster, which at that point you're not sorting 937 million records, you're sorting a high-ranking cluster as 'first' overall, and thereby completely ignoring 90% of the matches. You can then 'sort' within that smaller cluster and only present those results. You'd have to cluster around the primary sort field like that and pre-sort it offline. — Triynko, Jun 06 '18 at 22:19
Google tells you when you go to a higher page number in the results: "Sorry, Google does not serve more than 1000 results for any query. (You asked for results starting from 2156782.)" In other words... they CANT. It's all smoke and mirrors. — Triynko, Jun 06 '18 at 23:33

score 9 · Answer 1 · answered Nov 05 '13 at 11:13

The question would be really relevant if we were sure that the ranking was complete. It is quite possible that the ordering provided is approximate.

Given the fluidity of the ranking results, no answer that looks reasonable could be considered incorrect. For example, if an entire section of the web were excluded from the top results, you would not notice, provided they were included later.

This gives the developers a degree of latitude entirely unavailable in almost all other domains.

The real question to ask is - how precisely do the results match the actual rank assigned to each page?

score 6 · Accepted Answer · answered Oct 21 '13 at 14:19

One possible strategy is just rank the top-k instead of the entire list.

For example, to find the top 100 results from 1 millions hits, by selection algorithm the time complexity is O(n log k). Since k = 100 and n = 1,000,000, in practice we could ignore log(k).

Now, you only need O(n) to obtain the top 100 results out of 1 million hits.

bdean20 · Answer 3 · 2013-11-04T13:36:19.663

There are two major factors that influence the time it takes for you to get a response from your search engine.

The first is if you're storing your index on hard disk. If you're using a database, it's very likely that you're using the hard disk at least a little. From a cold boot, your queries will be slow until the data necessary for those queries has been pulled into the database cache.

The other is having a cache for your popular queries. It takes a lot longer to search for a query than it does to return results from a cache. Now, the random access time for a disk is too slow, so they need to have it stored in RAM.

To solve both of these problems, Google uses memcached. It's an application that caches the output of the Google search engine and feeds slightly old results to users. This is fine because most of the time the web doesn't change fast enough for it to be a problem, and because of the significant overlap in searches. You can be almost guaranteed that Barack Obama has been searched for recently.

Another issue that effects search engine latency is the network overheads. Google have been using a custom variant of the Linux (IIRC) that has been optimised for use as a web server. They've managed to reduce some of the time it takes to start turning around results to a query.

The moment a query hits their servers, the server immediately responds back to the user with the header for the HTTP response, even before Google has finished processing the query terms.

I'm sure they have a bunch of other tricks up their sleeves, too.

EDIT: They also keep their inverted lists sorted already, from the indexing process (it's better to process once than for each query).

With these pre-sorted lists, the most expensive operation is list intersection. Although I'm fairly sure Google doesn't rely on a vector space model, so list intersection isn't so much a factor for them.

The models that pay off the best according to the literature are the probabilistic models. As an example, you may wish to look up Okapi BM25. It does fairly well in practice within my area of research (XML Retrieval). When working with probabilistic models, it tends to be much more efficient to process document at a time instead of term at a time. What this means is that instead of getting a list of all of the documents that contain a term, we look at each document and rank it based on the terms it contains from our query (skipping documents that have no terms).

But if we want to be smart, we can approach the problem in a different way (but only when it appears to be better). If there's a query term that is extremely rare, we can rank with that first, because it has the highest impact. Then we rank with the next best term, and we continue until we've determined if it's likely that this document will be within our top k results.

score 1 · Answer 4 · answered Oct 21 '13 at 14:29

Also I guess the use of NoSQL databases instead of RDBMS helps.

NoSQL databases scales horizontally better, and don't generate bottlenecks. Big guys like Google Facebook or Twitter use them.

As other comments/answers suggested the data might be already sorted, and they are returning offsets of the data found instead of the whole batch.

The real question is not how they sort that many results that quickly, but how do they do it when tens or hundreds of millions of people around the world are querying google at the same time xD

score 1 · Answer 5 · answered Nov 06 '13 at 14:27

1

As Xiao said, just rank the top-k instead of the entire list.

Google tells you there are 937,000,000 results, but it won't show them all to you. If you keep scrolling page after page, after a while it will truncate the results :)

answered Nov 06 '13 at 14:27

Mau

14,234
2
31
52

1

How do you determine what the "top-k" is until you've ranked? If someone searches for something and it matches 937,000,000 records, how do you decide which of those are the "top". You have to rank them first. – Triynko Jun 06 '18 at 22:03

score 0 · Answer 6 · answered Oct 03 '13 at 14:42

0

Here you go, i looked it up for you and this is what i found! http://computer.howstuffworks.com/internet/basics/search-engine.htm

answered Oct 03 '13 at 14:42

user2044318

19
2

Thanks for the reference. It seems the **online ranking issue** is not discussed in the article either. – user1036719 Oct 03 '13 at 15:17

score 0 · Answer 7 · answered Nov 06 '13 at 14:08

This ia my theory...Its highly impossible that you are the first guy to search for a keyword.So for every keyword (or a combination) searched on a search engine, it maintains a hash of links to relevent web pages. Everytime you click a link in search results it gets a vote-up on the hashset of that keyword combination. Unfortunatly if you are the first guy, it saves your search keyword(for suggesting future searches) and starts the hashing of that keyword. So you end up with a fewer or no results at all. The page ranking as you might be knowing depends on many other factors too like backlinks,no. Of pages refering a keyword in seaech. etc.

score 0 · Answer 8 · answered Nov 06 '13 at 16:01

Regarding your update:

MapReduce framework is unlikely to be the key component for real-time ranking. MapReduce is designed for batch tasks. When submitting a job to a MapReduce framework, the response time is usually at least a minute, which is apparently too slow to meet our request.

MapReduce is not just designed for batch tasks. There are quite a lot MapReduce frameworks supporting real time computing: Apache Spark, Storm, Infinispan Distributed Executor, Hazelcast Distributed Executor Service.

Back to your question MapReduce is the key to distribute the query task to multiple nodes, and then merge the result together.

score 0 · Answer 9 · edited Jun 20 '20 at 09:12

There's no way you expect to get an accurate answer to this question here ;) Anyway, here's a couple of things to consider - Google uses a unique infrastructure, in every part of it. We cannot even guess the order of complexity of their network equipment or their database storage. That is all I know about the hardware component of this problem.

Now, for the software implementation - like the name says the PageRank is a rank by itself. It doesn't rank the pages when you enter the search query. I assume it ranks it on a totally independent part of the infrastructure every hour. And we already know that Google crawler bots are roaming the Web 24/7 so I assume that new pages are added into an "unsorted" hash map and then they are ranked on the next run of the algorithm.

Next, when you type your query, thousands of CPUs independently scan thousands of different parts of the PageRank database with a gapping factor. For example if the gapping factor is 10, one machine queries the part of the database that has PageRank values from 0-9.99, the other one queries the database from 10-19.99 etc. Since resources aren't an obstacle for Google they can set the gapping factor so low (for example 1) in order for each machine to query less than 100k pages which isn't to much for their hardware. Then when they need to compile the results of your query, since they know which machine ranks exactly which part of the database they can use the 'fill the pool' principle. Let n be the number of links on each Google page. The algorithm that combines the pages returned from queries ran on all those machines against all the different parts of database needs to only fill the first n results. So they take the results from the machine querying against the highest rank of the database. If it is greater than n they're done, if not they move to the next machine. This takes only O(q*g/r) where s is the quantity of the pages Google serves, g is the gapping factor and r is the highest value of PageRank. This assumption is encouraged by the fact that when you turn to second page your query is ran once again (notice the different time taken to generate it) .

This is just my two cents, but I think I'm pretty accurate with this hypothesis.

EDIT: You might want to check this out for complexity of high-order queries.

score 0 · Answer 10 · answered Nov 21 '13 at 03:39

I don't know what Google really does, but surely they use approximation. For example if the search query is 'Search engine' then the number of results will be = (no. of documents where there is one or more occurrence of the word 'search' + no. of documents where there is one or more occurrence of the word 'engine' ). This can be done in O(1) time complexity. For details read the basic structure of Google http://infolab.stanford.edu/~backrub/google.html.

How does a search engine rank millions of pages within 1 second?

10 Answers10