3

newbie programmer and lurker here, hoping for some sensible advice. :)

Using a combination of Python, BeautifulSoup, and the Bing API, I was able to find what I wanted with the following code:

import urllib2
from BeautifulSoup import BeautifulStoneSoup

Appid = #My Appid
query = #My query

soup = BeautifulStoneSoup(urllib2.urlopen("http://api.search.live.net/xml.aspx?Appid=" + Appid + "&query=" + query + "&sources=web"))
totalResults = soup.find('web:total').text

So I'd like to do this across a few thousand search terms and was wondering if

  1. doing this request a thousand times would be construed as hammering the server,
  2. what steps I should take to not hammer said servers (what are best practices?), and
  3. is there a cheaper (data) way to do this using any of the major search engine APIs?

It just seems unnecessarily expensive to grab all that data just to grab one number per keyword and I was wondering if I missed anything.

FWIW, I did some homework and tried the Google Search API (deprecated) and Yahoo's BOSS API (soon to be deprecated and replaced with a paid service) before settling with the Bing API. I understand direct scraping of a page is considered poor form so I'll pass on scraping search engines directly.

Björn Pollex
  • 75,346
  • 28
  • 201
  • 283
binarysolo
  • 355
  • 1
  • 3
  • 15

2 Answers2

1

There are three approaches I can think of that have helped previously when I had to do large scale URL resolution.

  1. HTTP Pipelining (another snippet here)
  2. Rate-limiting server requests per IP (i.e., each IP can only issue 3 requests / second). Some suggestions can be found here: How to limit rate of requests to web services in Python?
  3. Issuing requests through an internal proxy service, using http_proxy to redirect all requests to said service. This proxy service will then iterate over a set of network interfaces and issue rate limited requests. You can use Twisted for that.
Community
  • 1
  • 1
Mahmoud Abdelkader
  • 23,011
  • 5
  • 41
  • 54
  • Thanks, this is a lot more sophisticated of an answer than I needed but I really appreciate the help. :-) Would be useful in the future if I wanted to do something cool. – binarysolo Mar 10 '11 at 20:34
0

With regard to your question 1, Bing has an API Basics PDF file that summarizes the terms and conditions in human-readable form. In the "What you must do" section. That includes the following statement:

Restrict your usage to less than 7 queries per second (QPS) per IP address. You may be permitted to exceed this limit under some conditions, but this must be approved through discussion with api_tou@microsoft.com.

If this is just a one-off script, you don't need to do anything more complex than just adding a sleep between making requests, so that you're making only a couple of requests a second. If the situation is more complex, e.g. these requests are being made as part of a web service, the suggestions in Mahmoud Abdelkader's answer should help you.

Community
  • 1
  • 1
Mark Longair
  • 446,582
  • 72
  • 411
  • 327
  • Thanks this is all that I needed (one-off request for research). :-) Appreciate both the answers you guys gave! – binarysolo Mar 10 '11 at 20:33