2

I am trying to learn data scraping using python and have been using the Requests and BeautifulSoup4 libraries. It works well for normal websites. But when I tried to get some data out of websites where the table data loads after some delay, I found that I get an empty table. An examples would be this webpage

The script I've tried is a fairly routine one.

import requests
from bs4 import BeautifulSoup

response = requests.get("http://www.oddsportal.com/soccer/england/premier-league/everton-arsenal-tnWxil2o#over-under;2")
soup = BeautifulSoup(response.text, "html.parser")

content = soup.find('div', {'id': 'odds-data-portal'})

The data loads in the table odds-data-portal in the page but the code doesn't give me that. How can I make sure the table is loaded with data and get it first?

sfactor
  • 12,592
  • 32
  • 102
  • 152
  • 1
    The table (content) is probably generated by JavaScript and thus can't be "seen" when you just HTTP GET. – jayme Mar 17 '16 at 08:37

2 Answers2

4

Sorry, I can't open the link. But the table is probably generated in one of 2 ways:

  1. Purely by JavaScript with no AJAX call.
  2. Using an AJAX call and some JavaScript for DOM manipulation.

If it is the first case, then you have no option but to use selenium-webdriver in Python. Also, you can have a look at the example in this answer.

If it is the second case, then you can find out the URL and the data sent and then using requests module send a similar request to fetch the data. Data can be in JSON format or HTML (Depends on how good the developer is). You'll have to parse it accordingly.

Sometimes, the AJAX call may require, as data, a CSRF token or the cookie, in that case you'll have to revert back to the solution in the first case.

Community
  • 1
  • 1
JRodDynamite
  • 12,325
  • 5
  • 43
  • 63
2

You will need to use something like selenium to get the html. You could though continue to use BeautifulSoup to parse it as follows:

from bs4 import BeautifulSoup
from operator import itemgetter
from selenium import webdriver

url = "http://www.oddsportal.com/soccer/england/premier-league/everton-arsenal-tnWxil2o#over-under;2"
browser = webdriver.Firefox()

browser.get(url)
soup = BeautifulSoup(browser.page_source)
data_table = soup.find('div', {'id': 'odds-data-table'})

for div in data_table.find_all_next('div', class_='table-container'):
    row = div.find_all(['span', 'strong'])

    if len(row):
        print ','.join(cell.get_text(strip=True) for cell in itemgetter(0, 4, 3, 2, 1)(row))

This would display:

Over/Under +0.5,(8),1.04,11.91,95.5%
Over/Under +0.75,(1),1.04,10.00,94.2%
Over/Under +1,(1),1.04,11.00,95.0%
Over/Under +1.25,(2),1.13,5.88,94.8%
Over/Under +1.5,(9),1.21,4.31,94.7%
Over/Under +1.75,(2),1.25,3.93,94.8%
Over/Under +2,(2),1.31,3.58,95.9%
Over/Under +2.25,(4),1.52,2.59,95.7%   

Update - as suggested by @JRodDynamite, to run the headless PhantomJS can be used instead of Firefox. To do this:

  1. Download the PhantomJS Windows binary.

  2. Extract the phantomjs.exe executable and ensure it is in your PATH.

  3. Change the following line: browser = webdriver.PhantomJS()

JRodDynamite
  • 12,325
  • 5
  • 43
  • 63
Martin Evans
  • 45,791
  • 17
  • 81
  • 97
  • Thanks Martin ! This works nicely. One question about this. This seems to open the firefox browser, but this won't be available in a command line environment. How would one go about in those cases? – sfactor Mar 17 '16 at 09:57
  • 1
    It uses Firefox to do the processing and get the resulting html so you need it to run. There are tricks to make it hidden though. Try searching for `Selenium headless`. – Martin Evans Mar 17 '16 at 09:59
  • 3
    @sfactor - You can use a headless browser like [PhantomJS](http://phantomjs.org/). Hava a look at this [answer](http://stackoverflow.com/a/15699761/2932244) – JRodDynamite Mar 17 '16 at 09:59