pd.read_html(url) - awkward table design

Question

Table headings through the table are being converted into single column headings.

url = "https://www.environment.nsw.gov.au/topics/animals-and-plants/threatened-species/programs-legislation-and-framework/nsw-koala-strategy/local-government-resources-for-koala-conservation/north-coast-koala-management-area#:~:text=The%20North%20Coast%20Koala%20Management,Valley%2C%20Clarence%20Valley%20and%20Taree."
dfs = pd.read_html(url)
df = dfs[0]
df.head()

output

Be great if I could have the High preferred use as a column that assigns to the correct species. Tried reset_index() this did not work. I'm lost for searching can't find anything similar.

Response to @Master Oogway and thanks @DYZ for the edits.

There are multiple "table-striped"

The amendment suggested removes the error, but does not interact with the second table. Take White Box, Eucalyptus albens. Occurs in second table and not first. If I export dftable and filter - no White Box:

If I write htmltable to .txt when using find_all and search, it's there:

I have never done this before and appreciate that this is annoying. Thanks for the help so far.

It appears that find_all is gathering all the table data. But the creating of dftable is limiting to the first "table-striped".

DYZ · Answer 1 · 2022-11-28T22:33:10.607

3

The table cannot be easily parsed with read_html because of its unorthodox use of <thead> attribute. You can try luck with BeautifulSoup:

import bs4
import urllib.request

soup = bs4.BeautifulSoup(urllib.request.urlopen(url))
data = [["".join(cell.strings).strip() 
         for cell in row.find_all(['td', 'th'])] 
         for row in soup.find_all('table')[0].find_all('tr')]    
table = pd.DataFrame(data[1:])\
          .rename(columns=dict(enumerate(data[0])))\
          .dropna(how='all')

edited Nov 28 '22 at 22:33

answered Nov 28 '22 at 06:00

DYZ

55,249
10
64
93

Agree `pd.read_html()` is good for the first and easiest pass, not necessarily that it will handle every messy table in real life ;) Little note: *In newer code avoid old syntax `findAll()` instead use `find_all()` or `select()` with `css selectors` - For more take a minute to [check docs](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#method-names)* – HedgeHog Nov 28 '22 at 06:59
This was a great solution. But it merged the use type into the common name column. – GWAE Nov 28 '22 at 23:02

score 1 · Accepted Answer · answered Nov 28 '22 at 07:13

So I took a look at the link and the table you're trying to get.

The problem with the table in the link is that it contains multiple headers so the .read_html(URL) function, gets all of them and sets those as your header:

so instead of using pandas to read the HTML I used beautiful soup for what you're trying to accomplish.

With beautiful and urllib.requests I got the HTML from the URL and extracted the HTML with the table class name

url = "https://www.environment.nsw.gov.au/topics/animals-and-plants/threatened-species/programs-legislation-and-framework/nsw-koala-strategy/local-government-resources-for-koala-conservation/north-coast-koala-management-area#:~:text=The%20North%20Coast%20Koala%20Management,Valley%2C%20Clarence%20Valley%20and%20Taree."

#load html with urllib
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html.read(), 'lxml')


#get the table you're trying to get based
#on html elements
htmltable = soup.find('table', { 'class' : 'table-striped' })

Then using a function I found to make a list from tables extract from beautiful soup, I modified the function to get your values in a shape that would be easy to load into a dataframe and would also be easy to call depending on what you want:

[{"common name" : value, "Species name": value, "type": value}...{}]

def tableDataText(table):    
    """Parses a html segment started with tag <table> followed 
    by multiple <tr> (table rows) and inner <td> (table data) tags. 
    It returns a list of rows with inner columns. 
    Accepts only one <th> (table header/data) in the first row.
    """
    def rowgetDataText(tr, coltag='td'): # td (data) or th (header)       
        return [td.get_text(strip=True) for td in tr.find_all(coltag)]  
    rows = []
    trs = table.find_all('tr')
    headerow = rowgetDataText(trs[0], 'th')
    

    if headerow: # if there is a header row include first
        trs = trs[1:]
    for tr in trs: # for every table row

        #this part is modified
        #basically we'll get the type of 
        #used based of the second table header
        #in your url table html
        if(rowgetDataText(tr, 'th')):
          last_head = rowgetDataText(tr, 'th')

        #we'll add to the list a dict
        #that contains "common name", "species name", "type" (use type)
        if(rowgetDataText(tr, 'td')):
          row = rowgetDataText(tr, 'td')
          rows.append({headerow[0]: row[0], headerow[1]: row[1], 'type': last_head[0]})
              
    return rows

then when we convert the results of that function using the table content we extracted with beautiful soup we get this:

Then you can easily reference the type of use and each value common/species name.

Here is the full code:


import pandas as pd
from bs4 import BeautifulSoup
import urllib.request

url = "https://www.environment.nsw.gov.au/topics/animals-and-plants/threatened-species/programs-legislation-and-framework/nsw-koala-strategy/local-government-resources-for-koala-conservation/north-coast-koala-management-area#:~:text=The%20North%20Coast%20Koala%20Management,Valley%2C%20Clarence%20Valley%20and%20Taree."

#load html with urllib
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html.read(), 'lxml')


#get the table you're trying to get based
#on html elements
htmltable = soup.find('table', { 'class' : 'table-striped' })


#modified function taken from: https://stackoverflow.com/a/58274853/6297478
#to fit your data shape in a way that 
#you can use. 
def tableDataText(table):    
    """Parses a html segment started with tag <table> followed 
    by multiple <tr> (table rows) and inner <td> (table data) tags. 
    It returns a list of rows with inner columns. 
    Accepts only one <th> (table header/data) in the first row.
    """
    def rowgetDataText(tr, coltag='td'): # td (data) or th (header)       
        return [td.get_text(strip=True) for td in tr.find_all(coltag)]  
    rows = []
    trs = table.find_all('tr')
    headerow = rowgetDataText(trs[0], 'th')
    

    if headerow: # if there is a header row include first
        trs = trs[1:]
    for tr in trs: # for every table row

        #this part is modified
        #basically we'll get the type of 
        #used based of the second table header
        #in your url table html
        if(rowgetDataText(tr, 'th')):
          last_head = rowgetDataText(tr, 'th')

        #we'll add to the list a dict
        #that contains "common name", "species name", "type" (use type)
        if(rowgetDataText(tr, 'td')):
          row = rowgetDataText(tr, 'td')
          rows.append({headerow[0]: row[0], headerow[1]: row[1], 'type': last_head[0]})
              
    return rows

#we store our results from the function in list_table
list_table = tableDataText(htmltable)

#turn our table into a DataFrame
dftable = pd.DataFrame(list_table)
dftable

I left some comments for you in the code to help you out.

I hope this helps!

This produced exactly what I needed. I really appreciated all the responses to this question. I noticed the was different to examples I could find online. If anyone else ever faces that issues, this solution will be helpful to them also. Thank you @darkfenix54 and all for the responses. — GWAE, Nov 28 '22 at 23:04
When I change htmltable from find to findall and I look in the output it returns values for all the accordion tables on the page. However implementing the functions returns an attribute error. AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()? Any tips on that one. — GWAE, Nov 29 '22 at 04:27
@GWAE what’s the type of ResultSet when you’re passing it on to the function? Also what does it contain? It sounds like find_all returned a list of elements, and not a single element, so you would probably have to iterate through the list and in each iteration call .find() — Master Oogway, Nov 29 '22 at 04:55
@GWAE chance: ```trs = table.find_all('tr')``` to ```trs = table[0].find_all('tr')``` , since as stated above, ResultSet returns the results as a list, since there is only 1 table with the class name " table-striped" in the URL (that I saw), it's only returning one element. — Master Oogway, Nov 29 '22 at 05:40
If I change [0] to [1] I get the other table. I spent like 5 hours thinking about this yesterday. Woke up this morning, looked at it was like no way. Thanks again @Master Oogway. — GWAE, Nov 30 '22 at 21:57

score 0 · Answer 3 · answered Nov 28 '22 at 07:21

Just in addition to @DYZ approach, using css selectors, stripped_strings and find_previous(). This will create a list of dicts that will be transformed into a dataframe:

from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://www.environment.nsw.gov.au/topics/animals-and-plants/threatened-species/programs-legislation-and-framework/nsw-koala-strategy/local-government-resources-for-koala-conservation/north-coast-koala-management-area#:~:text=The%20North%20Coast%20Koala%20Management,Valley%2C%20Clarence%20Valley%20and%20Taree."

data = []
soup = BeautifulSoup(requests.get(url).text)
for e in soup.select('table tbody tr'):
    data.append(
        dict(
            zip(
                soup.table.thead.stripped_strings,
                [e.find_previous('th').get_text(strip=True)]+list(e.stripped_strings)
            )
        )
    )

pd.DataFrame(data)

	Common name	Species name	High preferred use
0	High preferred use	Grey gum	Eucalyptus biturbinata
1	High preferred use	Large-fruited grey gum	Eucalyptus canaliculata
...	...	...	...
107	Occasional use	Broad-leaved paperbark	Melaleuca quinquenervia
108	Occasional use	nan	nan

pd.read_html(url) - awkward table design

3 Answers3