0

The findAll function on beautiful soup is returning an empty array. I know this issue occurs when content cannot be found but there is content that fits the criteria I am searching by so I'm not sure what is going wrong. Here is the code:

# Import libraries
import requests
import urllib.request
import lxml
import html5lib
import time
from bs4 import BeautifulSoup

# Set the URL you want to webscrape from
url = 'https://tokcount.com/?user=mrsam993'

# Connect to the URL
response = requests.get(url)

# Parse HTML and save to BeautifulSoup object
soup = BeautifulSoup(response.text, "html.parser")

# for i in range(10):
links = soup.findAll('span', class_= 'odometer-value') #[i]
print(links)

And here is a picture of the information I am trying to scrape: HTML code image (The line at the bottom is the one I'm looking to scrape specifically if that helps at all).

  • well it is not the soup in the first place – Benoit Drogou Aug 03 '21 at 13:09
  • scraper are not always welcome... have you read the robot.txt politcs of the host? try to pass a user agent to your request, here you can see how to do that https://stackoverflow.com/questions/68633248/cant-parse-coin-gecko-page-from-today-with-beautifulsoup-because-of-cloudflare/68634188#68634188 . For user agent you can find them around... for example 'Mozilla/5.0 (X11; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/88.0' or such – cards Aug 03 '21 at 13:14

1 Answers1

2

The reason BS4 says that the element does not exist is that it is rendered by JavaScript and requests doesn't make XHR requests for you or emulate a real browser with JS support. When you first open the page, it shows you a loading screen.

You should use selenium with headless chrome/firefox to scrape JS Pages with python. If you want to use selenium, you can do something like this (example, you might need to use webdriverwait):

from selenium import webdriver
import urllib.request
import lxml
import html5lib
import time
from bs4 import BeautifulSoup

# Set the URL you want to webscrape from
url = 'https://tokcount.com/?user=mrsam993'

# Define options
options = webdriver.ChromeOptions()
options.add_argument("--headless")

# Connect to the URL
browser = webdriver.Chrome(options = options)
browser.get(url)

# Parse HTML and save to BeautifulSoup object
soup = BeautifulSoup(browser.page_source, "html.parser")
browser.quit()

# for i in range(10):
links = soup.findAll('span', class_= 'odometer-value') #[i]
print(links)

If you insist on using requests, go to the Network tab and inspect the XHR requests made and make them yourself with requests. If you're going with this approach and using firefox, I recommend you install Firebug to help out with this stuff. Here's what it looks like for your website: Image

Another thing worth mentioning is requests-html. Read the docs. Example using requests html:

from requests_html import HTMLSession
import urllib.request
import lxml
import html5lib
import time
from bs4 import BeautifulSoup

# Set the URL you want to webscrape from
url = 'https://tokcount.com/?user=mrsam993'

# Connect to the URL
session = HTMLSession()
r = session.get(url)

# Parse HTML and save to BeautifulSoup object
soup = BeautifulSoup(r.html, "html.parser")

# for i in range(10):
links = soup.findAll('span', class_= 'odometer-value') #[i]
print(links)

Please refer to this: Web-scraping JavaScript page with Python

And this too: Scrape javascript-rendered content with Python

Datajack
  • 88
  • 3
  • 16
  • If my answer solved your problem or any of the code samples above worked for you, or if it helped you significantly, please click the green tick mark next to my answer. – Datajack Aug 03 '21 at 13:38
  • Hi thank you so much for the answer just got a few extra questions, first of all do I need to run headless chrome from inside the program or is that some setting I need to change in my browser, secondly, when testing out the first code snippet it says that ""'str' object is not callable"" and an error pops up on the line that contains "html.parser", switching it to 'html.parser' also does not fix this, but if you could point me in the right direction that would be fantastic thank you – pythonsnake42 Aug 03 '21 at 15:57
  • @pythonsnake42, to make chrome headless, you just have to make a small change in your code. When you're defining browser as webdriver.Chrome, write `options = webdriver.ChromeOptions()`, then `options.add_argument("--headless")` and *then* define browser with `browser = webdriver.Chrome(options = options)` – Datajack Aug 04 '21 at 05:25
  • @pythonsnake42 The error you are getting is because I made a small mistake. I fixed that in the answer and also added the headless argument. Now the snippet of code should run properly. – Datajack Aug 04 '21 at 05:34