3

I am very new to using Beautiful Soup and I'm trying to import data from the below url as a pandas dataframe. However, the final result has the correct columns names, but no numbers for the rows. What should I be doing instead?

Here is my code:

from bs4 import BeautifulSoup
import requests

def get_tables(html):
    soup = BeautifulSoup(html, 'html.parser')
    table = soup.find_all('table')
    return pd.read_html(str(table))[0]

url = 'https://www.cmegroup.com/trading/interest-rates/stir/eurodollar.html'
html = requests.get(url).content
get_tables(html)
Jojo
  • 33
  • 3
  • Can you provide an output of what you are getting when you run the current code. And also can you share what your desired output should be. That will help us provide you some tips. – Joe Ferndz Oct 04 '20 at 19:08

2 Answers2

3

The data you see in the table is loaded from another URL via JavaScript. You can use this example to save the data to csv:

import json
import requests 
import pandas as pd

data = requests.get('https://www.cmegroup.com/CmeWS/mvc/Quotes/Future/1/G').json()

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

df = pd.json_normalize(data['quotes'])
df.to_csv('data.csv')

Saves data.csv (screenshot from LibreOffice):

enter image description here

Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • 1
    How were you able to find https://www.cmegroup.com/CmeWS/mvc/Quotes/Future/1/G? – Jojo Oct 04 '20 at 23:22
  • 1
    @Jojo I looked into Firefox developer tools -> Network tab (Chrome has something similar too). There are all requests the page is doing. One of these requests was this Json file. – Andrej Kesely Oct 05 '20 at 07:10
1

The website you're trying to scrape data from is rendering the table values dynamically and using requests.get will only return the HTML the server sends prior to JavaScript rendering. You will have to find an alternative way of accessing the data or render the webpages JS (see this example).

A common way of doing this is to use selenium to automate a browser which allows you to render the JavaScript and get the source code that way.

Here is a quick example:

import time 

import pandas as pd 
from selenium.webdriver import Chrome

#Request the dynamically loaded page source 
c = Chrome(r'/path/to/webdriver.exe')
c.get('https://www.cmegroup.com/trading/interest-rates/stir/eurodollar.html')

#Wait for it to render in browser
time.sleep(5)
html_data = c.page_source

#Load into pd.DataFrame 
tables = pd.read_html(html_data)
df = tables[0]
df.columns = df.columns.droplevel()    #Convert the MultiIndex to an Index 

Note that I didn't use BeautifulSoup, you can directly pass the html to pd.read_html. You'll have to do some more cleaning from there but that's the gist.

Alternatively, you can take a peak at requests-html which is a library that offers JavaScript rendering and might be able to help, search for a way to access the data as JSON or .csv from elsewhere and use that, etc.

Chris Greening
  • 510
  • 5
  • 14