2

I want to scrape google as fast as possible. Google in browser loads progressively in which loading results is divided into Waiting(TTFB) and Content Downlaod.

enter image description here

I am using python requests to scrape google. The timing of requests is only Waiting(TTFB) and a little Content Download.

enter image description here

Code is as below:

google_url = 'https://www.google.com/search?q={}&num={}&start={}'.\
        format(escaped_search_term, size, offset)
response = self.session.get(google_url, verify=False,
                            headers={'User-Agent': self.USER_AGENT})
return response.text

I want to load result as fast as possible, so i have to download google results progressively like browser. How can i do this?

Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
hamid
  • 694
  • 1
  • 8
  • 20
  • I think you would need to use Google Custom Search API, per this question: https://stackoverflow.com/questions/4082966/what-are-the-alternatives-now-that-the-google-web-search-api-has-been-deprecated . It will potentially cost money. If you don't use it, it could violate Google's TOS. – alex Sep 11 '19 at 15:12
  • @alex thanks for reply. My problem is getting results is so slow. I think even if i use Google Custom Search API the problem remains. – hamid Sep 11 '19 at 16:47
  • Do you mean that their API is slow or your code execution is slow? – alex Sep 11 '19 at 16:55
  • @alex I guess my code execution is slow. I want to download web page progressively like `stream` in downloading file, But i do not know how. – hamid Sep 12 '19 at 14:41
  • If the goal is to only download HTML in a streaming fashion, adapt code from [this answer](https://stackoverflow.com/a/39217788/1291371). 1. The total time to download is only 30 ms less for the request from the browser. 2. Is Python has a streaming HTML parser? If you're going to write response stream to file (in-memory or disk), read contents of the file, and parse it with `bs4`, then I don't see performance benefits of doing this. – ilyazub Jan 25 '21 at 10:07
  • @IlyaZub thanks for response. when I set `stream` as True response takes more time to download and total page is receiving in first callback so it does not help. any way I realized that google does not send search results progressively. google sends header of page at the first then sleeps for a while then sends all results together in at most 3 packets. so downloading page progressively does not help. – hamid Jan 25 '21 at 10:31

0 Answers0