pandas.read_html returns only one table

Question

I try to read ec2 pricing tables with pandas. Based on documentation I expect list of DataFrames, but got one table as a list.

Code example

import pandas
link = 'http://aws.amazon.com/ec2/pricing/' 
data = pandas.read_html(link)
print type(data)
print data[0]

Output

<type 'list'>
                               0                 1                2
0  Reserved Instance Volume Discounts               NaN              NaN
1            Total Reserved Instances  Upfront Discount  Hourly Discount
2                  Less than $250,000                0%               0%
3              $250,000 to $2,000,000                5%               5%
4            $2,000,000 to $5,000,000               10%              10%
5                More than $5,000,000        Contact Us       Contact Us

Environment:

Ubuntu 14.10
python 2.7.8
pandas 0.14.1

What about `type(data[0])`? The docs you link say `read_html` will return a list of DataFrames and that's what you got: a list containing 1 DataFrame. By examining the source, it looks like the only true HTML table (using `tr`, `td`) at that URL is the Reserved Instance Volume Discounts. — wflynny, Nov 24 '14 at 20:55

unutbu · Accepted Answer · 2014-11-25T12:55:19.093

http://aws.amazon.com/ec2/pricing/ uses JavaScript to fill in the data in the tables.

Unlike what you see when you point your GUI browser at the link, the data is missing if you download the HTML using urllib2:

import urllib2
response = urllib2.urlopen(link)
content = resonse.read()

(Then search the contents for <table> tags.)

To process the JavaScript you'll need an automated browser engine like Selenium, or WebKit or Spidermonkey.

Here is a solution using Selenium:

import selenium.webdriver as webdriver
import contextlib
import pandas as pd
@contextlib.contextmanager
def quitting(thing):
    yield thing
    thing.quit()

with quitting(webdriver.Firefox()) as driver:
    link = 'http://aws.amazon.com/ec2/pricing/' 
    driver.get(link)
    content = driver.page_source
    with open('/tmp/out.html', 'wb') as f:
        f.write(content.encode('utf-8'))
    data = pd.read_html(content)
    print len(data)

yields

Thanks for pointing the JavaScript, however I have a problem with your code example. There is something wrong with a profile " WebDriverException: Message: 'Can\'t load the profile. Profile Dir: /tmp/user/1001/tmpToFbg5" — Wawrzek, Nov 25 '14 at 09:52
It sounds like an [incompatibility between your versions](http://stackoverflow.com/q/20957968/190597) of Selenium and Firefox. — unutbu, Nov 25 '14 at 12:57

pandas.read_html returns only one table

1 Answers1