0

I want to be able to set the number of pages that make scraper crawls on google.

I'm confused on where to start. I don't really use the scraper opening multiple pages at once but by requesting them one at a time.


import requests
import re

keywords = ["site:facebook.com", "@gmail.com", "sports"]

url = 'https://google.com/search?q={}'.format('+'.join(keywords))
print(url)

response = requests.get(url)

regex = r"[\w._-]+@[\w._-]+\.[\w._-]+"

emails = re.findall(regex, str(response.content))

emails_list = list(set(emails))

print(emails_list)

It works fine when scraping the first page.

Sean Breckenridge
  • 1,932
  • 16
  • 26
Josh Jones
  • 25
  • 4

1 Answers1

0

FYI, Google will block you if you scrape them. This is probably fine for very few requests but if it seems like you're a bot, you'll probably get IP blocked. Maybe consider something like this to use as a proxy.

In any case, how to actually do this:

You can paginate google responses by passing the start GET param.

So for example, if you had a request like:

https://www.google.com/search?q=test

second page would be:

https://www.google.com/search?q=test&start=10

third page would be:

https://www.google.com/search?q=test&start=20

You can use urlencode to create the URL:

>>> from urllib.parse import urlencode
>>> "https://www.google.com/search?{}".format(urlencode({'q': " ".join(["site:facebook.com", "@gmail.com", "sports"]), "start": 10}))
'https://www.google.com/search?q=site%3Afacebook.com+%40gmail.com+sports&start=10'
Sean Breckenridge
  • 1,932
  • 16
  • 26
  • Do you know how many requests they will allow before blocking the IP address? – Josh Jones Aug 24 '19 at 20:06
  • @JoshJones Theres no set number. You should definitely be `time.sleep` (waiting) for a few (maybe 10?) between requests. But that also brings up another issue. If it sees an IP requesting the same request but increasing the page number one at a time, every 10 seconds, it can detect that behavior and know thats a bot. It does more analysis than just requests/min. – Sean Breckenridge Aug 24 '19 at 21:27
  • @JoshJones You can look at [this package](https://gitlab.com/hyperion-gray/googlespider) for some more context, it recommends waiting at least 30 seconds between requests. You can use [`random.randint`](https://docs.python.org/3.7/library/random.html#random.randint) to choose something between 30 and 50, for example, so its harded for them to detect you... but you're still risking it. The best way to do this is a proxy, theres lots of them online, but they are typically paid. The one I listed above does come with 1000 free API requests – Sean Breckenridge Aug 24 '19 at 21:29
  • If it wasnt clear, choosing to sleep for some time between 30 and 50 seconds would be done like: `time.sleep(random.randint(30, 50))` – Sean Breckenridge Aug 24 '19 at 21:35
  • Thanks. Imma try that with a proxy changer. I don’t need that many requests but enough to build a solid database. – Josh Jones Aug 25 '19 at 15:17