1

I want to scrape news headlines from this page: https://www.forexfactory.com/news while scrolling down and clicking on more button. enter image description here

I tried requests and bs4 but didn't return data:

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}

u = 'https://www.forexfactory.com/news'

session = requests.Session()
r = session.get(u, timeout=30, headers=headers)     # print(r.status_code)

soup = BeautifulSoup(r.content, 'html.parser')

soup.select('.flexposts__item.flexposts__story') # return []

print(r.status_code) #return 503

I checked Network button on the console and find other urls which return raw response data: enter image description here

I tried using requests but same response: 503

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}

u = 'https://www.forexfactory.com/flex.php?more=2'

session = requests.Session()
r = session.get(u, timeout=30, headers=headers)     

soup = BeautifulSoup(r.content, 'html.parser')

print(r.status_code) #return 503

print(r.text) #return html but without the headlines content

soup.select('.flexposts__item.flexposts__story') # return []

I also tried selenium but same, don't return headlines elements

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

options = Options()
options.add_argument("--headless")          

driver = webdriver.Chrome(executable_path=r"C:/chromedriver.exe", options=options)
u = 'https://www.forexfactory.com/news'
driver.get(u)
driver.implicitly_wait(60)
driver.find_elements(By.CSS_SELECTOR, '.flexposts__item.flexposts__story') # return []

soup = BeautifulSoup(driver.page_source, 'html.parser')
soup.select('.flexposts__item.flexposts__story') # return []

driver.quit()
khaled koubaa
  • 836
  • 3
  • 14

3 Answers3

3

The problem is that you are being blocked by Cloudflare DDoS protection:

A Distributed Denial of Service attack (DDoS) seeks to make an online service unavailable to its end users. For all plan types, Cloudflare provides unmetered mitigation of DDoS attacks at Layer 3, 4, and 7.

If you print the output for soup.prettify() you will see:

...
 <div class="attribution">
       DDoS protection by
       <a href="https://www.cloudflare.com/5xx-error-landing/" rel="noopener noreferrer" target="_blank">
        Cloudflare
       </a>
...

So, to avoid getting blocked with Selenium, you have two options:

  1. Don't run chrome in --headless mode
  2. Add the user-agent header when in --headless mode

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

options = Options()
options.add_argument("--headless")
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36")

driver = webdriver.Chrome(options=options)
u = 'https://www.forexfactory.com/news'
driver.get(u)
driver.implicitly_wait(60)
driver.find_elements(By.CSS_SELECTOR, '.flexposts__item.flexposts__story')

soup = BeautifulSoup(driver.page_source, 'html.parser')

print(soup.select('.flexposts__item.flexposts__story'))

driver.quit()

Using the requests library is not possible.

Further reading:

MendelG
  • 14,885
  • 4
  • 25
  • 52
0
import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}

u = 'https://www.forexfactory.com/news'

session = requests.Session()
r = session.get(u, timeout=30, headers=headers)     # print(r.status_code)

soup = BeautifulSoup(r.content, 'html.parser')

soup.select('.flexposts__item.flexposts__story') # return []

print(r.status_code) #returns 200
DSteman
  • 1,388
  • 2
  • 12
  • 25
0

I was able to scrape the headlines with selenium.

from selenium import webdriver

# enter the location for chromedriver executable file
chromePath = "chromedriver.exe" 
driver = webdriver.Chrome(chromePath)

driver.maximize_window()
url = 'https://www.forexfactory.com/news'
driver.get(url)

headlines = driver.find_elements_by_class_name('flexposts__story-title')

for headline in headlines:
    print(headline.text)
    print('')

driver.quit()

Edit : For some reason, clicking on the "More" button doesn't work in the driver window.

Dharman
  • 30,962
  • 25
  • 85
  • 135