Scraping news headlines using requests or selenium can't return data

Question

I want to scrape news headlines from this page: https://www.forexfactory.com/news while scrolling down and clicking on more button.

I tried requests and bs4 but didn't return data:

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}

u = 'https://www.forexfactory.com/news'

session = requests.Session()
r = session.get(u, timeout=30, headers=headers)     # print(r.status_code)

soup = BeautifulSoup(r.content, 'html.parser')

soup.select('.flexposts__item.flexposts__story') # return []

print(r.status_code) #return 503

I checked Network button on the console and find other urls which return raw response data:

I tried using requests but same response: 503

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}

u = 'https://www.forexfactory.com/flex.php?more=2'

session = requests.Session()
r = session.get(u, timeout=30, headers=headers)     

soup = BeautifulSoup(r.content, 'html.parser')

print(r.status_code) #return 503

print(r.text) #return html but without the headlines content

soup.select('.flexposts__item.flexposts__story') # return []

I also tried selenium but same, don't return headlines elements

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

options = Options()
options.add_argument("--headless")          

driver = webdriver.Chrome(executable_path=r"C:/chromedriver.exe", options=options)
u = 'https://www.forexfactory.com/news'
driver.get(u)
driver.implicitly_wait(60)
driver.find_elements(By.CSS_SELECTOR, '.flexposts__item.flexposts__story') # return []

soup = BeautifulSoup(driver.page_source, 'html.parser')
soup.select('.flexposts__item.flexposts__story') # return []

driver.quit()

MendelG · Answer 1 · 2021-06-21T20:32:37.900

The problem is that you are being blocked by Cloudflare DDoS protection:

A Distributed Denial of Service attack (DDoS) seeks to make an online service unavailable to its end users. For all plan types, Cloudflare provides unmetered mitigation of DDoS attacks at Layer 3, 4, and 7.

If you print the output for soup.prettify() you will see:

...
 <div class="attribution">
       DDoS protection by
       <a href="https://www.cloudflare.com/5xx-error-landing/" rel="noopener noreferrer" target="_blank">
        Cloudflare
       </a>
...

So, to avoid getting blocked with Selenium, you have two options:

Don't run chrome in --headless mode
Add the user-agent header when in --headless mode

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

options = Options()
options.add_argument("--headless")
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36")

driver = webdriver.Chrome(options=options)
u = 'https://www.forexfactory.com/news'
driver.get(u)
driver.implicitly_wait(60)
driver.find_elements(By.CSS_SELECTOR, '.flexposts__item.flexposts__story')

soup = BeautifulSoup(driver.page_source, 'html.parser')

print(soup.select('.flexposts__item.flexposts__story'))

driver.quit()

Using the requests library is not possible.

Further reading:

Python - Request being blocked by Cloudflare
Denial-of-service attack (Wikipedia).

score 0 · Answer 2 · answered Jun 21 '21 at 19:14

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}

u = 'https://www.forexfactory.com/news'

session = requests.Session()
r = session.get(u, timeout=30, headers=headers)     # print(r.status_code)

soup = BeautifulSoup(r.content, 'html.parser')

soup.select('.flexposts__item.flexposts__story') # return []

print(r.status_code) #returns 200

It returns 200 when I execute it – DSteman Jun 21 '21 at 19:19 — DSteman, Jun 21 '21 at 19:19

score 0 · Answer 3 · edited Jun 21 '21 at 20:11

I was able to scrape the headlines with selenium.

from selenium import webdriver

# enter the location for chromedriver executable file
chromePath = "chromedriver.exe" 
driver = webdriver.Chrome(chromePath)

driver.maximize_window()
url = 'https://www.forexfactory.com/news'
driver.get(url)

headlines = driver.find_elements_by_class_name('flexposts__story-title')

for headline in headlines:
    print(headline.text)
    print('')

driver.quit()

Edit : For some reason, clicking on the "More" button doesn't work in the driver window.

Scraping news headlines using requests or selenium can't return data

3 Answers3