68

I'd like to take an HTML table and parse through it to get a list of dictionaries. Each list element would be a dictionary corresponding to a row in the table.

If, for example, I had an HTML table with three columns (marked by header tags), "Event", "Start Date", and "End Date" and that table had 5 entries, I would like to parse through that table to get back a list of length 5 where each element is a dictionary with keys "Event", "Start Date", and "End Date".

Thanks for the help!

Andrew
  • 3,901
  • 15
  • 50
  • 64

4 Answers4

86

You should use some HTML parsing library like lxml:

from lxml import etree
s = """<table>
  <tr><th>Event</th><th>Start Date</th><th>End Date</th></tr>
  <tr><td>a</td><td>b</td><td>c</td></tr>
  <tr><td>d</td><td>e</td><td>f</td></tr>
  <tr><td>g</td><td>h</td><td>i</td></tr>
</table>
"""
table = etree.HTML(s).find("body/table")
rows = iter(table)
headers = [col.text for col in next(rows)]
for row in rows:
    values = [col.text for col in row]
    print dict(zip(headers, values))

prints

{'End Date': 'c', 'Start Date': 'b', 'Event': 'a'}
{'End Date': 'f', 'Start Date': 'e', 'Event': 'd'}
{'End Date': 'i', 'Start Date': 'h', 'Event': 'g'}
Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
74

Hands down the easiest way to parse a HTML table is to use pandas.read_html() - it accepts both URLs and HTML.

import pandas as pd
url = r'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
tables = pd.read_html(url) # Returns list of all tables on page
sp500_table = tables[0] # Select table of interest

Only downside is that read_html() doesn't preserve hyperlinks.

klimenkov
  • 347
  • 2
  • 8
zelusp
  • 3,500
  • 3
  • 31
  • 65
35

Sven Marnach excellent solution is directly translatable into ElementTree which is part of recent Python distributions:

from xml.etree import ElementTree as ET

s = """<table>
  <tr><th>Event</th><th>Start Date</th><th>End Date</th></tr>
  <tr><td>a</td><td>b</td><td>c</td></tr>
  <tr><td>d</td><td>e</td><td>f</td></tr>
  <tr><td>g</td><td>h</td><td>i</td></tr>
</table>
"""

table = ET.XML(s)
rows = iter(table)
headers = [col.text for col in next(rows)]
for row in rows:
    values = [col.text for col in row]
    print(dict(zip(headers, values)))

same output as Sven Marnach's answer...

Hugo
  • 27,885
  • 8
  • 82
  • 98
  • 1
    +1 because it allows using cElementTree instead of ElementTree which is considerably faster than lxml if large number of tables are involved – Cerno Apr 06 '16 at 16:00
  • 1
    I have a web page saved from wikipedia. How can I specify to ET which table to parse and fetch data ? Is it possible by table name or table id ? – Massimo May 01 '17 at 14:31
  • 1
    also, and don't work. see https://stackoverflow.com/q/49286753/8929814 – CopyPasteIt Mar 14 '18 at 21:10
21

If the HTML is not XML you can't do it with etree. But even then, you don't have to use an external library for parsing a HTML table. In python 3 you can reach your goal with HTMLParser from html.parser. I've the code of the simple derived HTMLParser class here in a github repo.

You can use that class (here named HTMLTableParser) the following way:

import urllib.request
from html_table_parser import HTMLTableParser

target = 'http://www.twitter.com'

# get website content
req = urllib.request.Request(url=target)
f = urllib.request.urlopen(req)
xhtml = f.read().decode('utf-8')

# instantiate the parser and feed it
p = HTMLTableParser()
p.feed(xhtml)
print(p.tables)

The output of this is a list of 2D-lists representing tables. It looks maybe like this:

[[['   ', ' Anmelden ']],
 [['Land', 'Code', 'Für Kunden von'],
  ['Vereinigte Staaten', '40404', '(beliebig)'],
  ['Kanada', '21212', '(beliebig)'],
  ...
  ['3424486444', 'Vodafone'],
  ['  Zeige SMS-Kurzwahlen für andere Länder ']]]
schmijos
  • 8,114
  • 3
  • 50
  • 58