I am completely new to web scraping and would like to parse a specific table that occurs in the SEC filing DEF 14A of companies. I was able to get the right URL and pass it to panda. Note: Even though the desired table should occur in every DEF 14A, it's layout may differ from company to company. Right now I am struggling with formatting the dataframe. How do I manage to get the right header and join it into a single index(column)?
This is my code so far:
url_to_use: "https://www.sec.gov/Archives/edgar/data/1000229/000095012907000818/h43371ddef14a.htm"
resp = requests.get(url_to_use)
soup = bs.BeautifulSoup(resp.text, "html.parser")
dfs = pd.read_html(resp.text, match="Salary")
pd.options.display.max_columns = None
df = dfs[0]
df.dropna(how="all", inplace = True)
df.dropna(axis = 1, how="all", inplace = True)
display(df)
Right now the output of my code looks like this: Dataframe output
Whereas the correct layout looks like this: Original format
Is there some way to identify those rows that belong to the header and combine them as the header?