15

I have an html document, and I want to pull the tables out of this document and return them as arrays. I'm picturing 2 functions, one that finds all the html tables in a document, and a second one that turns html tables into 2-dimensional arrays.

Something like this:

htmltables = get_tables(htmldocument)
for table in htmltables:
    array=make_array(table)

There's 2 catches: 1. The number tables varies day to day 2. The tables have all kinds of weird extra formatting, like bold and blink tags, randomly thrown in.

Thanks!

Zach
  • 29,791
  • 35
  • 142
  • 201

3 Answers3

21

Use BeautifulSoup (I recommend 3.0.8). Finding all tables is trivial:

import BeautifulSoup

def get_tables(htmldoc):
    soup = BeautifulSoup.BeautifulSoup(htmldoc)
    return soup.findAll('table')

However, in Python, an array is 1-dimensional and constrained to pretty elementary types as items (integers, floats, that elementary). So there's no way to squeeze an HTML table in a Python array.

Maybe you mean a Python list instead? That's also 1-dimensional, but anything can be an item, so you could have a list of lists (one sublist per tr tag, I imagine, containing one item per td tag).

That would give:

def makelist(table):
  result = []
  allrows = table.findAll('tr')
  for row in allrows:
    result.append([])
    allcols = row.findAll('td')
    for col in allcols:
      thestrings = [unicode(s) for s in col.findAll(text=True)]
      thetext = ''.join(thestrings)
      result[-1].append(thetext)
  return result

This may not yet be quite what you want (doesn't skip HTML comments, the items of the sublists are unicode strings and not byte strings, etc) but it should be easy to adjust.

Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
  • 2
    Beautiful soup is great and easy! Also try using lxml+xpath if looking for more speed. – Jon W May 20 '10 at 02:41
  • @user, always glad to help. If it's so good an answer to your question, you should "accept" it (by clicking the checkmark-shaped icon below the number of votes on the answer's upper left) -- that's a key part of SO's etiquette!-) – Alex Martelli May 20 '10 at 04:05
  • One more question: what if the table has a header row? – Zach May 20 '10 at 04:08
  • That would have `th` items rather than `td`, so the corresponding sublist in `result` would be empty -- you could just add `if not result[-1]: del result[-1]` after the `for col` loop to remove such empty rows, for example. – Alex Martelli May 20 '10 at 04:15
  • what if I'd like to include those header rows in the list? – Zach May 20 '10 at 04:31
  • Then you'll need to look for `th` as well as `td`. – Alex Martelli May 20 '10 at 04:35
  • Here's what I'm using, to include the header rows: allcols = row.findAll(re.compile('(td)|(th)')) – Zach May 20 '10 at 19:02
  • @user, yep, good idea, "finding" a RE rather than just a string is quite a good way to "look for th as well as td" as I recommended. – Alex Martelli May 20 '10 at 22:45
  • Panda's is the better option since it handles headers, colspan, and rowspan natively. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html "This function attempts to properly handle colspan and rowspan attributes. If the function has a argument, it is used to construct the header, otherwise the function attempts to find the header within the body (by putting rows with only elements into the header)." – user12989841 Mar 06 '22 at 01:31
11

Pandas can extract all of the tables in your html to a list of dataframes right out of the box, saving you from having to parse the page yourself (reinventing the wheel). A DataFrame is a powerful type of 2-dimensional array.

I recommend continuing to work with the data via Pandas since it's a great tool, but you can also convert to other formats if you prefer (list, dictionary, csv file, etc.).

Example

"""Extract all tables from an html file, printing and saving each to csv file."""

import pandas as pd

df_list = pd.read_html('my_file.html')

for i, df in enumerate(df_list):
    print df
    df.to_csv('table {}.csv'.format(i))

Getting the html content directly from the web instead of from a file would only require a slight modification:

import requests

html = requests.get('my_url').content
df_list = pd.read_html(html)
MarredCheese
  • 17,541
  • 8
  • 92
  • 91
1

A +1 to the question-asker and another to the god of Python.
Wanted to try this example using lxml and CSS selectors.
Yes, this is mostly the same as Alex's example:

import lxml.html
markup = lxml.html.fromstring('''<html><body>\
<table width="600">
    <tr>
        <td width="50%">0,0,0</td>
        <td width="50%">0,0,1</td>
    </tr>
    <tr>
        <td>0,1,0</td>
        <td>0,1,1</td>
    </tr>
</table>
<table>
    <tr>
        <td>1,0,0</td>
        <td>1,<blink>0,</blink>1</td>
        <td>1,0,2</td>
        <td><bold>1</bold>,0,3</td>
    </tr>
</table>
</body></html>''')

tbl = []
rows = markup.cssselect("tr")
for row in rows:
  tbl.append(list())
  for td in row.cssselect("td"):
    tbl[-1].append(unicode(td.text_content()))

pprint(tbl)
#[[u'0,0,0', u'0,0,1'],
# [u'0,1,0', u'0,1,1'],
# [u'1,0,0', u'1,0,1', u'1,0,2', u'1,0,3']]
mechanical_meat
  • 163,903
  • 24
  • 228
  • 223