0

I'm trying to scrape data from a website using Beautifulsoup in python, and when I parsed the page, the information that I want to scrape doesn't show up, and instead I see this:

<span class="frwp-debug hidden" style="display: none!important; visibility: hidden!important;">  

The parsed html is different from what I see when I inspect the page.

This is my code:

site = "http://www.fifa.com/worldcup/stories/y=2017/m=11/news=australia-2921204.html#World_Cup_History" 
hdr = {'User-Agent': 'Mozilla/5.0'} 
page = requests.get(site) 
soup = BeautifulSoup(page.text, "html.parser") 
print(soup.prettify())

How do I scrape the hidden information?

hx chua
  • 1
  • 1
  • `BeautifulSoup` parses the HTML properly - it's just that the page loads all its contents over Ajax and BS doesn't handle that. From the first look, i think you need to parse the value from `stScript.setAttribute('data-storyid', ...);` and build the proper URL to get that JSON - Or get started with selenium. – wiesion Jun 03 '18 at 16:45

1 Answers1

1

The problem is that the content you want is being created by javascript after the page is loaded. BeautifulSoup can't parse that content through the requests library. Fortunately, you can use the Selenium library together with PhantomJS to get the fully rendered data, and then use BeautifulSoup to parse the resulting (finished) html.

Here's how that would work in your case:

from bs4 import BeautifulSoup
from selenium import webdriver

site = "http://www.fifa.com/worldcup/stories/y=2017/m=11/news=australia-2921204.html#World_Cup_History"
browser = webdriver.PhantomJS()
browser.get(site)
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())

That should solve your problem.

Note that you'll have to install a couple of things, including selenium pip install selenium and the PhantomJS webdriver (downloadable from http://phantomjs.org/download.html -- you may have to add it to your system path depending on how you install. I used this SO answer for that.)

jchung
  • 903
  • 1
  • 11
  • 23
  • This is strange. I still get the same parsed html that hides the information for the span tag. I ran into a warning though - not sure if this should be a concern: warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless... – hx chua Jun 04 '18 at 05:54
  • PhantomJS works for me with Selenium v3.12.0 on Python 2.7. If you want to use the headless chrome browser, that's also fine. https://intoli.com/blog/running-selenium-with-headless-chrome/ – jchung Jun 04 '18 at 06:02