0

This question is similar to this one. I have read the answers, but none worked for me. I am trying to get the informations from the bluish box in this site.

This is what I wrote:

import requests
from bs4 import BeautifulSoup
import re

url = 'https://boardgamegeek.com/boardgame/161936/pandemic-legacy-season-1'

req = requests.get(url)
soup = BeautifulSoup(req.text,'html5lib')
soup = soup.find('div', class_='game-header-body')

print(soup.prettify())

I get this error AttributeError: 'NoneType' object has no attribute 'prettify'. The reason is because it cannot find the 'game-header-body', therefore becomes NoneType. When I remove the soup = soup.find('div', class_='game-header-body') line, I can see all the html code except the div I am interested in.

I have read that maybe it is better to change to the 'html5lib' parser library. I installed it through pip3 install html5lib (I am using python 3.4.3), but still I get the aforementioned error. What should I do?

Martin Evans
  • 45,791
  • 17
  • 81
  • 97
Pigna
  • 2,792
  • 5
  • 29
  • 51
  • 2
    Hi, the element **game-header-body** is not present in the page source, it is loaded by javascript. So, you will need selenium, it will load the javascript, then you can extract. – Stack Jun 15 '17 at 10:16

1 Answers1

1

The element game-header-body is not present in the HTML source but is rendered later by javascript. You need something like selenium to help with this. It can load the browser of your choice (including a headerless one if needed) which will then do the javascript for you. You can then access the resulting HTML after the page has fully loaded and parse it using BeautifulSoup.

The following would be an example of how this could be done using an already installed Firefox browser:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary

url = 'https://boardgamegeek.com/boardgame/161936/pandemic-legacy-season-1'

browser = webdriver.Firefox(firefox_binary=FirefoxBinary())
browser.get(url)
soup = BeautifulSoup(browser.page_source, "html.parser")
browser.quit()

for div in soup.find_all('div', class_='game-header-body'):
    print(div.prettify())
    print("----------------")

Note, there are multiple game-header-body divs, so this displays all of them.

Martin Evans
  • 45,791
  • 17
  • 81
  • 97
  • Thank you very much, it worked! Just one thing: could you please explain me this line: `from selenium.webdriver.firefox.firefox_binary import FirefoxBinary`? I tried deleting it and deleting the `webdriver.Firefox()` parameter and it seems to work the same. Is it necessary? Why so? – Pigna Jun 15 '17 at 10:49
  • 1
    Different versions of selenium needed different settings, it's just one I have used that I know still works with my version. If the other works for you, that's good too. – Martin Evans Jun 15 '17 at 10:51