0

I try to read ec2 pricing tables with pandas. Based on documentation I expect list of DataFrames, but got one table as a list.

Code example

import pandas
link = 'http://aws.amazon.com/ec2/pricing/' 
data = pandas.read_html(link)
print type(data)
print data[0]

Output

<type 'list'>
                               0                 1                2
0  Reserved Instance Volume Discounts               NaN              NaN
1            Total Reserved Instances  Upfront Discount  Hourly Discount
2                  Less than $250,000                0%               0%
3              $250,000 to $2,000,000                5%               5%
4            $2,000,000 to $5,000,000               10%              10%
5                More than $5,000,000        Contact Us       Contact Us

Environment:

  • Ubuntu 14.10
  • python 2.7.8
  • pandas 0.14.1
Wawrzek
  • 452
  • 5
  • 18
  • What about `type(data[0])`? The docs you link say `read_html` will return a list of DataFrames and that's what you got: a list containing 1 DataFrame. By examining the source, it looks like the only true HTML table (using `tr`, `td`) at that URL is the Reserved Instance Volume Discounts. – wflynny Nov 24 '14 at 20:55

1 Answers1

1

http://aws.amazon.com/ec2/pricing/ uses JavaScript to fill in the data in the tables.

Unlike what you see when you point your GUI browser at the link, the data is missing if you download the HTML using urllib2:

import urllib2
response = urllib2.urlopen(link)
content = resonse.read()

(Then search the contents for <table> tags.)

To process the JavaScript you'll need an automated browser engine like Selenium, or WebKit or Spidermonkey.

Here is a solution using Selenium:

import selenium.webdriver as webdriver
import contextlib
import pandas as pd
@contextlib.contextmanager
def quitting(thing):
    yield thing
    thing.quit()

with quitting(webdriver.Firefox()) as driver:
    link = 'http://aws.amazon.com/ec2/pricing/' 
    driver.get(link)
    content = driver.page_source
    with open('/tmp/out.html', 'wb') as f:
        f.write(content.encode('utf-8'))
    data = pd.read_html(content)
    print len(data)

yields

238
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Thanks for pointing the JavaScript, however I have a problem with your code example. There is something wrong with a profile " WebDriverException: Message: 'Can\'t load the profile. Profile Dir: /tmp/user/1001/tmpToFbg5" – Wawrzek Nov 25 '14 at 09:52
  • It sounds like an [incompatibility between your versions](http://stackoverflow.com/q/20957968/190597) of Selenium and Firefox. – unutbu Nov 25 '14 at 12:57