2

I am trying to get data from Reuters and have the code as below. But I think due to continuous requests, I got blocked from scraping more data. Is there a way to resolve this? I am using Google Colab. Although there are a lot of similar questions, they are all unanswered. So would really appreciate if I could get some help with this. Thanks!

!pip install selenium
!apt-get update 
!apt install chromium-chromedriver

from selenium import webdriver
import time
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
driver.maximize_window()
driver.implicitly_wait(10)
driver.get("https://www.reuters.com/companies/AAPL.O")
links=[]
news=[]
i=0
try:
    while True:
        news = driver.find_elements_by_xpath("//div[@class='item']")
        driver.execute_script("arguments[0].scrollIntoView(true);", news[i])
        if news[i].find_element_by_tag_name("time").get_attribute("innerText") == "a year ago":
            break
        links.append(news[i].find_element_by_tag_name("a").get_attribute("href"))
        i += 1
        time.sleep(.5)
except:
    pass

driver.quit()

#links
for link in links:
  paragraphs = driver.find_elements_by_xpath("//div[contains(@class,'Article__container')]/div/div/div[2]/p")
  for para in paragraphs:
    news.append(para.get_attribute("innerText"))


import pandas as pd

df = pd.DataFrame({'x':links, 'y':news})
df

Full error stacktrace: enter image description here

huy
  • 176
  • 2
  • 13

1 Answers1

1

Here's a generic answer.

Following are the list of things to keep in mind when scraping a website to prevent detection-

1) Adding User-Agent headers- Many websites do not allow access to their website if valid headers are not passed, and user-agent header is a very important one.

Example:- chrome_options.add_argument("user-agent=Mozilla/5.0")

2) Setting window-size when going headless- Websites are often able to detect when headless browsers are being run on their server, a common workaround is to add window-size argument to your scripts.

Example:- chrome_options.add_argument("--window-size=1920,1080")

3) Mimicking human behavior- Avoid clicking or navigating through the website at very fast rates. Use timely waits to make your behavior more human-like.

4) Using random waits - This is a continuation of the previous point, people often try to keep constant delays between actions, even that can lead to detection. Randomize them as well.

5) User-Agent rotation- Try changing your user agent time-to-time when scraping a website. (Read More)

6) IP-rotation (Using proxies)- Some websites ban individual IP's or even complete geographical areas from accessing their sites, if they are detected as a scraper. Rotating your IP might trick the server into believing that the requests are coming from different devices. IP-rotation combined with User-Agent rotation can be very effective.

Note:- Please don't use any freely available proxies, they have very low success rate, and hardly work. Use premium proxy services.

7) Using external libraries- There are a lot cases where all the above methods might not work, when the website has very good bot detection mechanism. At that time, you might as well try the undetected_chromedriver library. It has come in handy a few times.

Kamalesh S
  • 522
  • 1
  • 5
  • 14