Scraping tables using Pandas read_html and identifying headers

Question

I am completely new to web scraping and would like to parse a specific table that occurs in the SEC filing DEF 14A of companies. I was able to get the right URL and pass it to panda. Note: Even though the desired table should occur in every DEF 14A, it's layout may differ from company to company. Right now I am struggling with formatting the dataframe. How do I manage to get the right header and join it into a single index(column)?

This is my code so far:

url_to_use: "https://www.sec.gov/Archives/edgar/data/1000229/000095012907000818/h43371ddef14a.htm"

resp = requests.get(url_to_use)
soup = bs.BeautifulSoup(resp.text, "html.parser")

dfs = pd.read_html(resp.text, match="Salary")
pd.options.display.max_columns = None
df = dfs[0]

df.dropna(how="all", inplace = True)
df.dropna(axis = 1, how="all", inplace = True)

display(df)

Right now the output of my code looks like this: Dataframe output

Whereas the correct layout looks like this: Original format

Is there some way to identify those rows that belong to the header and combine them as the header?

RJ Adriaansen · Accepted Answer · 2021-08-31T11:16:26.060

The table html is rather messed up. The empty cells are actually in the source code. It would be easiest to do some post processing:

import pandas as pd
import requests

r = requests.get("https://www.sec.gov/Archives/edgar/data/1000229/000095012907000818/h43371ddef14a.htm", headers={'User-agent': 'Mozilla/5.0'}).text
df = pd.read_html(r) #load with user agent to avoid 401 error

df = df[40] #get the right table from the list of dataframes
df = df[8:].rename(columns={i: ' '.join(df[i][:8].dropna()) for i in df.columns}) #generate column headers from the first 8 rows

df.dropna(how='all', axis=1, inplace=True) #remove empty columns and rows
df.dropna(how='all', axis=0, inplace=True)
df.reset_index(drop=True, inplace=True)

def sjoin(x): return ''.join(x[x.notnull()].astype(str))
df = df.groupby(level=0, axis=1).apply(lambda x: x.apply(sjoin, axis=1)) #concatenate columns with the same headers, taken from https://stackoverflow.com/a/24391268/11380795

Result

	All Other Compensation ($)(4)	Change in Pension Value and Nonqualified Deferred Compensation Earnings ($)	Name and Principal Position	Non-Equity Incentive Plan Compensation ($)	Salary ($)	Stock Awards ($)(1)	Total ($)	Year
0	8953	(3)	David M. Demshur President and Chief Executive Officer	766200(2)	504569	1088559	2368281	2006
1	8944	(3)	Richard L. Bergmark Executive Vice President, Chief Financial Officer and Treasurer	330800(2)	324569	799096	1463409	2006
2	8940	(3)	Monty L. Davis Chief Operating Officer and Senior Vice President	320800(2)	314569	559097	1203406	2006
3	8933	(3)	John D. Denson Vice President, General Counsel and Secretary	176250(2)	264569	363581	813333	2006

Thank you very much! For a newbie like me, your comments in the code helped a lot to understand whats going on. Is it somehow possible to get an automated identification process for the headers? E.g. in the filing of this company [link](https://www.sec.gov/Archives/edgar/data/0001000623/000110465907015873/a07-6100_1def14a.htm) ale the headers of the table are in it's first row. Here, just the first row should be considered as a heading and there is no need to join multiple rows. Is it possible to differentiate between those cases within the code and automatically choose the right header? — xxgaryxx, Aug 31 '21 at 12:20
Yes, you could for example do a lookup for the index number of the first row in the first column that contains 'Position' and add 1 to get the index nr: `df[df[0].str.contains('Position', na=False)].index[0] + 1`. Assign this to a value and replace the `8` in my answer with that value and you've got it automated. — RJ Adriaansen, Aug 31 '21 at 20:22
Thanks, this works just fine when the table has the same amount of rows in each column. Unfortunately, there are cases where some columns actually have more rows for the header than others. E.g.: https://www.sec.gov/Archives/edgar/data/1000230/000119312507035211/ddef14a.htm (Table is on p. 13). In such a case, some parts of the header go in the table: https://imgur.com/a/ZUOiUBg Is there some workaround for this problem? Maybe applying the following logic: Over all columns, consider everything above the first numeric entry in the "Year" column as header. — xxgaryxx, Sep 02 '21 at 13:56

Scraping tables using Pandas read_html and identifying headers

1 Answers1