Parse HTML table to Python list?

Question

I'd like to take an HTML table and parse through it to get a list of dictionaries. Each list element would be a dictionary corresponding to a row in the table.

If, for example, I had an HTML table with three columns (marked by header tags), "Event", "Start Date", and "End Date" and that table had 5 entries, I would like to parse through that table to get back a list of length 5 where each element is a dictionary with keys "Event", "Start Date", and "End Date".

Thanks for the help!

Sven Marnach · Accepted Answer · 2017-02-22T19:14:44.690

86

You should use some HTML parsing library like lxml:

from lxml import etree
s = """<table>
  <tr><th>Event</th><th>Start Date</th><th>End Date</th></tr>
  <tr><td>a</td><td>b</td><td>c</td></tr>
  <tr><td>d</td><td>e</td><td>f</td></tr>
  <tr><td>g</td><td>h</td><td>i</td></tr>
</table>
"""
table = etree.HTML(s).find("body/table")
rows = iter(table)
headers = [col.text for col in next(rows)]
for row in rows:
    values = [col.text for col in row]
    print dict(zip(headers, values))

prints

{'End Date': 'c', 'Start Date': 'b', 'Event': 'a'}
{'End Date': 'f', 'Start Date': 'e', 'Event': 'd'}
{'End Date': 'i', 'Start Date': 'h', 'Event': 'g'}

edited Feb 22 '17 at 19:14

answered Jun 12 '11 at 22:59

Sven Marnach

574,206
118
941
841

My table has a varying number of rows. How can I make it work if this is the case? Thanks for the response, btw. – Andrew Jun 12 '11 at 23:09
1

@Andrew: The above code works for any number of rows and any number of columns, as long as every row has the same number of columns. – Sven Marnach Jun 12 '11 at 23:44
1

I'd suggest `HTMLParser`/`html.parser`, but this solution is much better in this case. – Jasmijn Jun 13 '11 at 09:25
This was a useful pointer for additional research. I actually have some broken HTML to parse, so some other answers involving lxml.html also proved useful. – Rob Fagen Jun 03 '14 at 21:50
it fails if html contains unquoted attrs like "
– Maxdestroyer Feb 22 '17 at 18:50
@Maxdestroyer Looking at this again, you should probably user `etree.HTML`, not `etree.XML` to get a more relaxed syntax. – Sven Marnach Feb 22 '17 at 19:11
@Maxdestroyer Edited my answer. – Sven Marnach Feb 22 '17 at 19:15
also, and don't work. see https://stackoverflow.com/q/49286753/8929814 – CopyPasteIt Mar 14 '18 at 21:10

score 74 · Answer 2 · edited Jun 14 '23 at 01:06

74

Hands down the easiest way to parse a HTML table is to use pandas.read_html() - it accepts both URLs and HTML.

import pandas as pd
url = r'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
tables = pd.read_html(url) # Returns list of all tables on page
sp500_table = tables[0] # Select table of interest

Only downside is that read_html() doesn't preserve hyperlinks.

edited Jun 14 '23 at 01:06

klimenkov

347
2
8

answered Jul 14 '17 at 23:48

zelusp

3,500
3
31
65

1

Awesome! Thanks. – Raf Aug 29 '17 at 13:45
2

Thanks! This was very quick and easy. – Shmuel Kamensky Jan 03 '18 at 05:54
Not a good way for tables containing `rowspan` and `colspan`! – John Strood Aug 08 '18 at 10:08
2

@JohnStrood Looking forward to reading your answer on how to handle `rowspan` and `colspan` – tommy.carstensen Aug 08 '18 at 23:19
1

@tommy.carstensen Ah! I used `bs4` to build an element tree, and traversed through the elements to break row-spanned column-spanned cells into constituent cells. – John Strood Aug 09 '18 at 06:48
@tommy.carstensen There are already answers here: https://stackoverflow.com/a/39336433/5337834 and https://stackoverflow.com/a/9980393/5337834. If you're still unsatisfied, I'll write my own answer! – John Strood Aug 09 '18 at 06:59
@zelusp I just learned, that Pandas is *extremely* slow, if your html has 100+ tables and you just want a single table with a specific `id`. Beautifulsoup is much faster in this case. – tommy.carstensen Jan 08 '20 at 00:15
This method is really simple and also works for small table findings, loved it! Thanks. – hp77 Jan 05 '22 at 16:40

score 35 · Answer 3 · edited Oct 03 '20 at 08:43

35

Sven Marnach excellent solution is directly translatable into ElementTree which is part of recent Python distributions:

from xml.etree import ElementTree as ET

s = """<table>
  <tr><th>Event</th><th>Start Date</th><th>End Date</th></tr>
  <tr><td>a</td><td>b</td><td>c</td></tr>
  <tr><td>d</td><td>e</td><td>f</td></tr>
  <tr><td>g</td><td>h</td><td>i</td></tr>
</table>
"""

table = ET.XML(s)
rows = iter(table)
headers = [col.text for col in next(rows)]
for row in rows:
    values = [col.text for col in row]
    print(dict(zip(headers, values)))

same output as Sven Marnach's answer...

edited Oct 03 '20 at 08:43

Hugo

27,885
8
82
98

answered Sep 06 '11 at 06:46

1

+1 because it allows using cElementTree instead of ElementTree which is considerably faster than lxml if large number of tables are involved – Cerno Apr 06 '16 at 16:00
1

I have a web page saved from wikipedia. How can I specify to ET which table to parse and fetch data ? Is it possible by table name or table id ? – Massimo May 01 '17 at 14:31
1

also, and don't work. see https://stackoverflow.com/q/49286753/8929814 – CopyPasteIt Mar 14 '18 at 21:10

score 21 · Answer 4 · answered Mar 11 '14 at 08:31

If the HTML is not XML you can't do it with etree. But even then, you don't have to use an external library for parsing a HTML table. In python 3 you can reach your goal with HTMLParser from html.parser. I've the code of the simple derived HTMLParser class here in a github repo.

You can use that class (here named HTMLTableParser) the following way:

import urllib.request
from html_table_parser import HTMLTableParser

target = 'http://www.twitter.com'

# get website content
req = urllib.request.Request(url=target)
f = urllib.request.urlopen(req)
xhtml = f.read().decode('utf-8')

# instantiate the parser and feed it
p = HTMLTableParser()
p.feed(xhtml)
print(p.tables)

The output of this is a list of 2D-lists representing tables. It looks maybe like this:

[[['   ', ' Anmelden ']],
 [['Land', 'Code', 'Für Kunden von'],
  ['Vereinigte Staaten', '40404', '(beliebig)'],
  ['Kanada', '21212', '(beliebig)'],
  ...
  ['3424486444', 'Vodafone'],
  ['  Zeige SMS-Kurzwahlen für andere Länder ']]]

neat indeed. It will break if some td have a colspan though – mr.bjerre Sep 03 '21 at 09:03 — mr.bjerre, Sep 03 '21 at 09:03

Parse HTML table to Python list?

4 Answers4

Linked