0

I am completely new to web scraping and would like to parse a specific table that occurs in the SEC filing DEF 14A of companies. I was able to get the right URL and pass it to panda. Note: Even though the desired table should occur in every DEF 14A, it's layout may differ from company to company. Right now I am struggling with formatting the dataframe. How do I manage to get the right header and join it into a single index(column)?

This is my code so far:

url_to_use: "https://www.sec.gov/Archives/edgar/data/1000229/000095012907000818/h43371ddef14a.htm"

resp = requests.get(url_to_use)
soup = bs.BeautifulSoup(resp.text, "html.parser")

dfs = pd.read_html(resp.text, match="Salary")
pd.options.display.max_columns = None
df = dfs[0]

df.dropna(how="all", inplace = True)
df.dropna(axis = 1, how="all", inplace = True)

display(df)

Right now the output of my code looks like this: Dataframe output

Whereas the correct layout looks like this: Original format

Is there some way to identify those rows that belong to the header and combine them as the header?

xxgaryxx
  • 335
  • 1
  • 7

1 Answers1

1

The table html is rather messed up. The empty cells are actually in the source code. It would be easiest to do some post processing:

import pandas as pd
import requests

r = requests.get("https://www.sec.gov/Archives/edgar/data/1000229/000095012907000818/h43371ddef14a.htm", headers={'User-agent': 'Mozilla/5.0'}).text
df = pd.read_html(r) #load with user agent to avoid 401 error

df = df[40] #get the right table from the list of dataframes
df = df[8:].rename(columns={i: ' '.join(df[i][:8].dropna()) for i in df.columns}) #generate column headers from the first 8 rows

df.dropna(how='all', axis=1, inplace=True) #remove empty columns and rows
df.dropna(how='all', axis=0, inplace=True)
df.reset_index(drop=True, inplace=True)

def sjoin(x): return ''.join(x[x.notnull()].astype(str))
df = df.groupby(level=0, axis=1).apply(lambda x: x.apply(sjoin, axis=1)) #concatenate columns with the same headers, taken from https://stackoverflow.com/a/24391268/11380795

Result

All Other Compensation ($)(4) Change in Pension Value and Nonqualified Deferred Compensation Earnings ($) Name and Principal Position Non-Equity Incentive Plan Compensation ($) Salary ($) Stock Awards ($)(1) Total ($) Year
0 8953 (3) David M. Demshur President and Chief Executive Officer 766200(2) 504569 1088559 2368281 2006
1 8944 (3) Richard L. Bergmark Executive Vice President, Chief Financial Officer and Treasurer 330800(2) 324569 799096 1463409 2006
2 8940 (3) Monty L. Davis Chief Operating Officer and Senior Vice President 320800(2) 314569 559097 1203406 2006
3 8933 (3) John D. Denson Vice President, General Counsel and Secretary 176250(2) 264569 363581 813333 2006
RJ Adriaansen
  • 9,131
  • 2
  • 12
  • 26
  • Thank you very much! For a newbie like me, your comments in the code helped a lot to understand whats going on. Is it somehow possible to get an automated identification process for the headers? E.g. in the filing of this company [link](https://www.sec.gov/Archives/edgar/data/0001000623/000110465907015873/a07-6100_1def14a.htm) ale the headers of the table are in it's first row. Here, just the first row should be considered as a heading and there is no need to join multiple rows. Is it possible to differentiate between those cases within the code and automatically choose the right header? – xxgaryxx Aug 31 '21 at 12:20
  • Yes, you could for example do a lookup for the index number of the first row in the first column that contains 'Position' and add 1 to get the index nr: `df[df[0].str.contains('Position', na=False)].index[0] + 1`. Assign this to a value and replace the `8` in my answer with that value and you've got it automated. – RJ Adriaansen Aug 31 '21 at 20:22
  • Thanks, this works just fine when the table has the same amount of rows in each column. Unfortunately, there are cases where some columns actually have more rows for the header than others. E.g.: https://www.sec.gov/Archives/edgar/data/1000230/000119312507035211/ddef14a.htm (Table is on p. 13). In such a case, some parts of the header go in the table: https://imgur.com/a/ZUOiUBg Is there some workaround for this problem? Maybe applying the following logic: Over all columns, consider everything above the first numeric entry in the "Year" column as header. – xxgaryxx Sep 02 '21 at 13:56