How to scrape a javascript website in Python?

Question

I am trying to scrape a website. I have tried using two methods but both do not provide me with the full website source code that I am looking for. I am trying to scrape the news titles from the website URL provided below.

URL: "https://www.todayonline.com/"

These are the two methods I have tried but failed.

Method 1: Beautiful Soup

tdy_url = "https://www.todayonline.com/"
page = requests.get(tdy_url).text
soup = BeautifulSoup(page)
soup  # Returns me a HTML with javascript text
soup.find_all('h3')

### Returns me empty list []

Method 2: Selenium + BeautifulSoup

tdy_url = "https://www.todayonline.com/"

options = Options()
options.headless = True

driver = webdriver.Chrome("chromedriver",options=options)

driver.get(tdy_url)
time.sleep(10)
html = driver.page_source

soup = BeautifulSoup(html)
soup.find_all('h3')

### Returns me only less than 1/4 of the 'h3' tags found in the original page source

Please help. I have tried scraping other news websites and it is so much easier. Thank you.

The news data on the website you are trying to scrape is fetched with JavaScript, and is not returned by the server. But in the first example you are getting just the page returned by the server -- neither requests nor BeautifulSoup execute JS. However, you can open the Firefox (Chromium) DevTools and take a look at which requests get the data from the server, and try to imitate them with requests then. It might be even easier than trying to do webscraping with BeautifulSoup. — Demian Wolf, Sep 06 '20 at 08:37
See the @politicalscientist answer also. He does exactly what I descriped in the first comment. — Demian Wolf, Sep 06 '20 at 08:40

score 4 · Answer 1 · edited Jun 05 '21 at 10:29

The news data on the website you are trying to scrape is fetched from the server using JavaScript (this is called XHR -- XMLHttpRequest). It is happening dynamically, while the page is loading or being scrolled. so this data is not returned inside the page returned by the server.

In the first example, you are getting only the page returned by the server -- without the news, but with JS that is supposed to get them. Neither requests nor BeautifulSoup can execute JS.

However, you can try to reproduce requests that are getting news titles from the server with Python requests. Do the following steps:

Open DevTools of your browser (usually you have to press F12 or the combination of Ctrl+Shift+I for that), and take a look at requests that are getting news titles from the server. Sometimes, it is even easier than web scraping with BeautifulSoup. Here is a screenshot (Firefox):

Copy the request link (right-click -> Copy -> Copy link), and pass it to requests.get(...).
Get .json() of the request. It will return a dict that is easy to work with. To better understand the structure of the dict, I would recommend to use pprint instead of simple print. Note you have to do from pprint import pprint before using it.

Here is an example of the code that gets the titles from the main news on the page:

import requests


nodes = requests.get("https://www.todayonline.com/api/v3/news_feed/7")\
        .json()["nodes"]
for node in nodes:
    print(node["node"]["title"])

If you want to scrape a group of news under caption, you need to change the number after news_feed/ in the request URL (to get it, you just need to filter the requests by "news_feed" in the DevTools and scroll the news page down).

Sometimes web sites have protection against bots (although the website you are trying to scrape doesn't). In such cases, you might need to do these steps as well.

score 3 · Accepted Answer · answered Sep 06 '20 at 08:38

3

You can access data via API (check out the Network tab):

For example,

import requests
url = "https://www.todayonline.com/api/v3/news_feed/7"
data = requests.get(url).json()

answered Sep 06 '20 at 08:38

help-ukraine-now

3,850
4
19
36

score 2 · Answer 3 · edited Feb 23 '21 at 22:08

I will suggest you the fairly simple approach,

import requests
from bs4 import BeautifulSoup as bs

page = requests.get('https://www.todayonline.com/googlenews.xml').content
soup = bs(page)
news = [i.text for i in soup.find_all('news:title')]

print(news)

output

['DBS named world’s best bank by New York-based financial publication',
 'Russia has very serious questions to answer on Navalny - UK',
 "Exclusive: 90% of China's Sinovac employees, families took coronavirus vaccine - CEO",
 'Three militants killed after fatal attack on policeman in Tunisia',
.....]

Also, you can check the XML page for more information if required.

P.S. Always check for the compliance before scraping any website :)

score 0 · Answer 4 · answered Sep 06 '20 at 08:39

There are different ways of gathering the content of a webpage that contains Javascript.

Using selenium with Firefox web driver
Using a headless browser with phantomJS
Making an API call using a REST client or python requests library

You have to do your research first

How to scrape a javascript website in Python?

Method 1: Beautiful Soup

Method 2: Selenium + BeautifulSoup

4 Answers4