0

So, I have an have an html file with several tables in it, I am reading the file like that:

tables = pd.read_html(filename, decimal=',', thousands=None, header=0)

However, pandas set the header from the first table to the rest of the other tables. Is there any way to set pandas to collect headers for each of the tables?

Anna-Lischen
  • 856
  • 1
  • 16
  • 35

2 Answers2

0

Got it. So finally, thanks to that genius reply I have made it:

tables = pd.read_html(filename, decimal=',', thousands=None)
        for t in tables:
            header = t.iloc[0]
            print(header)
            t = t[1:]
            t.columns=header
            print(t)

So I have assigned different header-values to each of the tables.

Anna-Lischen
  • 856
  • 1
  • 16
  • 35
0

Pandas is typically reading it correctly in different Dataframes. I never use headers=0 but in the below shown case it works with or without fine.

How does your html file look like? Maybe you have to clean your html string first. Could you share?

The below example works well

import pandas as pd


html_tables = """
<html>
<header>
<title>Data Tables</title>
</header>
<body>
Table1
<table>
  <thead>
    <tr>
      <th>Title 1</th>
      <th>Number 1</th> 
      <th>Year 1</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>C</td>
      <td>122,4</td> 
      <td>1972</td>
    </tr>
    <tr>
      <td>Python</td>
      <td>2,44</td> 
      <td>1989</td>
    </tr>
    <tr>
      <td>Ruby</td>
      <td>44,55</td> 
      <td>1995</td>
    </tr>
  </tbody>
</table>
Some text in between<br>
Table2
<table>
  <thead>
    <tr>
      <th>Title 2</th>
      <th>Number 2</th> 
      <th>Year 2</th>
    </tr>
  </thead>
  <tbody>
     <tr>
      <td>C</td>
      <td>111,4</td> 
      <td>1872</td>
    </tr>
    <tr>
      <td>Python</td>
      <td>4,55</td> 
      <td>1889</td>
    </tr>
    <tr>
      <td>Ruby</td>
      <td>66,55</td> 
      <td>1895</td>
    </tr>
  </tbody>
</table>
Text after
</body>
"""

dfs = pd.read_html(html_tables,decimal=',', thousands=None)

print("First Dataframe")
print("###################")
print(dfs[0])
print("###################")
print("Second Dataframe")
print("###################")
print(dfs[1])



First Dataframe
###################
  Title 1  Number 1  Year 1
0       C    122.40    1972
1  Python      2.44    1989
2    Ruby     44.55    1995
###################
Second Dataframe
###################
  Title 2  Number 2  Year 2
0       C    111.40    1872
1  Python      4.55    1889
2    Ruby     66.55    1895

Alex Ortner
  • 1,097
  • 8
  • 24
  • Well, you see, the file is really old, it's about 2004 and at the same time it's rather poorly written, even for HTML standards of 2004. I can't really share the HTML with you now, as I am in the development process. However, I will try modify the file 2night and add it here just for the sake of experience. – Anna-Lischen Sep 09 '19 at 11:55