2

I want to make google news scraper with Python and BeautifulSoup but I have read that there is a chance that I can be banned.

I have also read that I can prevent this with using some rotating proxies and rotating IP addresses. Only thing I managed to do Is to make rotating User-Agent. Can you show me how to add rotating proxy and rotating IP address?

I know that it should be added in request.get() part but I do not know how.

This is my code:

from bs4 import BeautifulSoup
import requests

headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

term = 'usa'
page=0

for page in range(1,5):

    page = page*10

    url = 'https://www.google.com/search?q={}&tbm=nws&sxsrf=ACYBGNTx2Ew_5d5HsCvjwDoo5SC4U6JBVg:1574261023484&ei=H1HVXf-fHfiU1fAP65K6uAU&start={}&sa=N&ved=0ahUKEwi_q9qog_nlAhV4ShUIHWuJDlcQ8tMDCF8&biw=1280&bih=561&dpr=1.5'.format(term,page)
    print(url)

    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    headline_text = soup.find_all('h3', class_= "r dO0Ag")

    snippet_text = soup.find_all('div', class_='st')

    news_date = soup.find_all('div', class_='slp')

    print(len(news_date))
taga
  • 3,537
  • 13
  • 53
  • 119
  • Does this answer your question? [Is it ok to scrape data from Google results?](https://stackoverflow.com/questions/22657548/is-it-ok-to-scrape-data-from-google-results) – Ramon Medeiros Nov 26 '19 at 11:02
  • No, my question is how to set some parameters to prevent being banned. – taga Nov 26 '19 at 11:18

5 Answers5

2

You can do searches with the proper API from Google:

https://developers.google.com/custom-search/v1/overview

Ramon Medeiros
  • 2,272
  • 2
  • 24
  • 41
1

You can use https://gimmmeproxy.com for rotating proxies and it's python wrapper: https://github.com/DeyaaMuhammad/GimmeProxyApi.

proxy = GimmeProxyAPI(protocol="https")

proxies = {
  'http': proxy,
  'https': proxy
}

requests.get('https://example.org', proxies=proxies)
Andrey E
  • 605
  • 4
  • 8
0

If you want to learn web scraping, best choose some other website, like reddit or some magazine online. Google news (and other google services) are well protected against scraping and they change the names of classes regularly enough to prevent you from doing it the easy way.

Lina Yemely
  • 101
  • 3
  • I have watched some tutorials and classes are the same for 2-3 years. Im not saying that they do not change them, im just saying that some things are staying the same. – taga Nov 26 '19 at 09:29
0

If your question is 'What to do to get not banned?', then the answer is 'Don't violate the TOS' which means no scraping at all and using the proper search API. There is some amount of "free" google search uses, based on the ip address you are using. So if you only scraping a handful of searches, this should be no problem.

If your question is 'How to use a proxy with requests module?', then you should start looking here.

import requests

proxies = {
  'http': 'http://10.10.1.10:3128',
  'https': 'http://10.10.1.10:1080',
}

requests.get('http://example.org', proxies=proxies)

But this is only the python side, you need to setup a web-proxy (or even better a pool of proxies) yourself and then use an algorithm to choose a different proxy every N requests for example.

Frieder
  • 1,208
  • 16
  • 25
  • So I cant just paste this in my code? I have read the link that you posted, I have read it a couple of days ago. But when I paste the code that you posted, I get an error – taga Nov 26 '19 at 09:32
  • The proxy server is not related to python (although you use python to implement one). If you don't want to setup a proxy server on your own, then there are some public http proxies available (google them), but keep in mind that there are privacy issues when using http and public servers are slow/unreliable most of the time. – Frieder Nov 26 '19 at 09:35
  • I have tried this but I always get this error, and I have also tried with different proxies: File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/urllib3/poolmanager.py", line 420, in __init__ raise ProxySchemeUnknown(proxy.scheme) urllib3.exceptions.ProxySchemeUnknown: Not supported proxy scheme None – taga Jan 23 '20 at 09:35
0

One more simple trick is like Using Google colab in the Brave Tor browser and then see the results that you will get different ip addresses.

So, once you'll get the data that you want then you can use that data in you jupyter notebook or VS Code or elsewhere.

See, the results in the screenshots:

Using free proxies will get an error because there are too many requests on the free proxies so, you have to pick every time different one whose proxy is getting lower traffic so that's a terrible task to chose one out of hundreds.

Using free proxies will get an error because there are too many requests on the free proxies so, you have to pick every time different one whose proxy is getting lower traffic so that's a terrible task to chose one out of hundreds

Getting correct results with Brave Tor VPN: Getting correct results with Brave Tor VPN

Mayur Gupta
  • 303
  • 2
  • 14