1

I'm trying to parse tables from lots of html pages. Each tagret table has next structure:

<table width="100%%" border="2" bordercolor="navy">
  <tr bordercolor="#0000FF">
    <td width="20%%" height="22" bgcolor="navy"><font color="#FFFFFF"><b>Field1</b></font></td>
    <td width="20%%" height="22" bgcolor="navy"><font color="#FFFFFF"><b>Field2</b></font></td>
     <td width="60%%" height="22" bgcolor="navy"><font color="#FFFFFF"><b>Field3</b></font></td>
  </tr>
    <tr>
    <td width="12%">A1</td>
    <td width="32%"><a href="../">A2</a></td>
    <td width="56%">A3</td>
  </tr>
  <tr>
    <td width="12%">B1</td>
    <td width="32%"><a href="../">B2</a></td>
    <td width="56%">B3
</td>
  </tr>
  <tr>
    <td width="12%">C1</td>
    <td width="32%"><a href="../">C2</a></td>
    <td width="56%">C3</td>
  </tr>
  <tr>
    <td width="12%">D1</td>
    <td width="32%"><a href="../">D2</a></td>
    <td width="56%">D3</td>
  </tr>

</table>

Number of rows varies from page to page, so parser should be able to work for any number of rows. I would like to collect info from each html page like

A1 A2 A3
B1 B2 B3
C1 C2 C3
D1 D2 D3

How can I do that?

tima
  • 51
  • 1
  • 1
  • 8

1 Answers1

3

You can use find_all() and get_text() to gather the table data. The find_all() method returns a list that contains all descendants of a tag; and get_text() returns a string that contains a tag's text contents. First select all tabes, for each table select all rows, for each row select all columns and finally extract the text. That would collect all table data in the same order and structure that it appears on the HTML document.

from bs4 import BeautifulSoup

html = 'my html document'
soup = BeautifulSoup(html, 'html.parser')
tables = [
    [
        [td.get_text(strip=True) for td in tr.find_all('td')] 
        for tr in table.find_all('tr')
    ] 
    for table in soup.find_all('table')
]

The tables variable contains all the tables in the document, and it is a nested list that has the following structure,

tables -> rows -> columns

If the structure is not important and you only want to collect text from all tables in one big list, use:

table_data = [i.text for i in soup.find_all('td')]

Or if you prefer CSS selectors:

table_data = [i.text for i in soup.select('td')]

If the goal is to gather table data regardless of HTML attributes or other parameters, then it may be best to use pandas. The pandas.read_html() method reads HTML from URLs, files or strings, parses it and returns a list of dataframes that contain the table data.

import pandas as pd

html = 'my html document'
tables = pd.read_html(html)

Note that pandas.read_html() is more fragile than BeautifulSoup and it will raise a Value Error if it fails to parse the HTML or if the document doesn't have any tables.

t.m.adam
  • 15,106
  • 3
  • 32
  • 52
  • 1
    I faced some problems with empty cells in tables. Usage of `td.text for td in tr.find_all('td')` instead of `td.string.strip()` helped me with that issue – tima Aug 25 '17 at 11:04
  • I used `strip()` because it removes trailing spaces, tabs, etc, and produces clean text. Also i didn't collect columns from the 1st row ( `[1:]` ) as it seems to be a heading. Of course my code is a generic example based on the html in your post; you can modify it to fit your needs. – t.m.adam Aug 25 '17 at 11:33
  • 1
    I think that `td.text` or `td.get_text()` is a better way of retrieving the text content in the table. For differences between `.text` and `.string`, please refer to https://stackoverflow.com/questions/25327693/difference-between-string-and-text-beautifulsoup – wei ren Feb 13 '19 at 07:05