4

I have a set of search queries in the size of approx. 10 millions. The goal is to collect the number of hits returned by a search engine for all of them. For example, Google returns about 47,500,000 for the query "stackoverflow".

The problem is that:

1- Google API is limited to 100 query per day. This is far from being useful to my task since I would have to get lots of counts.

2- I used Bing API but it does not return an accurate number. Accureate in the sense of matching the number of hits shown in Bing UI. Has anyone came across this issue before?

3- Issuing search queries to a search engine and parsing the html is one solution but it results in CAPTCHA and does not scale to this number of queries.

All I care about is that the number of hits and I am open for any suggestion.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Joe O
  • 41
  • 3
  • 1
    I am also interested in this... just so you know any large search engines won't always return you the same results because of sharding across the servers. In other words run that same google search in an hour when there is different traffic and you could get a significantly different number because you hit a different one of their servers which wasn't as up to date. Also check out the paper [Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL](http://www.cs.washington.edu/education/courses/cse573/04au/papers/0212033.pdf) – hackartist Feb 07 '12 at 19:13

2 Answers2

4

Well, I was really hoping that someone would answer this since this is something that I also was interested in finding out but since it doesn't look like anyone will I will throw in these suggestions.

You could set up a series of proxies that change their IP every 100 requests so that you can query google as seemingly different people (seems like a lot of work). Or you can download wikipedia and write something to parse the data there so that when you search a term you can see how many pages it falls in. Of course that is a much smaller dataset than the whole web but it should get you started. Another possible data source is the google n-grams data which you can download and parse to see how many books and pages the search terms fall in. Maybe a combination of these methods could boost the accuracy on any given search term.

Certainly none of these methods are as good as if you could just get the google page counts directly but understandably that is data they don't want to give out for free.

hackartist
  • 5,172
  • 4
  • 33
  • 48
  • Thanks hackartist for your answer. I don't have previous experience with setting up a series of proxies and orchestrating the traffic. So, I would rather leave this as one of the last options. Wikipedia is not a representative dataset for my task. I have tried it and it is not useful. I am using the google n-grams data now and also would prefer to use the Microsoft dataset which provides access to title, body, and anchor text statistics. The problem with the Microsoft data though is that it only returns probabilities and not simple counts. Thanks again. – Joe O Feb 09 '12 at 16:59
  • What type of project are you trying to use this data for -- i.e. what is the correct type of source text? Don't forget twitter and the blogosphere if you are looking for current things people talk about. (also on StackOverflow when you find an answer helpful please vote it up or accept it since that adds the answerer's reputation which they can use to then get other people to answer their questions) Best of luck – hackartist Feb 09 '12 at 18:21
1

I see this is a very old question but I was trying to do the same thing which brought me here. I'll add some info and my progress to date:

Firstly, the reason you get an estimate that can change wildly is because search engines use probabilistic algorithms to calculate relevance. This means that during a query they do not need to examine all possible matches in order to calculate the top N hits by relevance with a fair degree of confidence. That means that when the search concludes, for a large result set, the search engine actually doesn't know the total number of hits. It has seen a representative sample though, and it can use some statistics about the terms used in your query to set an upper limit on the possible number of hits. That's why you only get an estimate for large result sets. Running the query in such a way that you got an exact count would be much more computationally intensive.

The best I've been able to achieve is to refine the estimate by tricking the search engine into looking at more results. To do this you need to go to page 2 of the results and then modify the 'first' parameter in the URL to go way higher. Doing this may allow you to find the end of the result set (this worked for me last year I'm sure although today it only worked up to the first few thousand). Even if it doesn't allow you to get to the end of the result set you will see that the estimate gets better as the query engine considers more hits.

I found Bing slightly easier to use in the above way - but I was still unable to get an exact count for the site I was considering. Google seems to be actively preventing this use of their engine which isn't that surprising. Bing also seems to hit limits although they looked more like defects.

For my use case I was able to get both search engines to fairly similar estimates (148k for Bing, 149k for Google) using the above technique. The highest hit count I was able to get from Google was 323 whereas Bing went up to 700 - both wildly inaccurate but not surprising since this is not their intended use of the product.

If you want to do it for your own site you can use the search engine's webmaster tools to view indexed page count. For other sites I think you'd need to use the search engine API (at some cost).

Jesse Bugden
  • 111
  • 2