I am using Python BeautifulSoup to extract the risk factor section from 10K filings (so I need to locate where the section begins and ends). For example:
https://www.sec.gov/Archives/edgar/data/1321502/000143774910004615/andain_10k-123106.htm
To get this element:
ITEM 1A. RISK FACTORS
I would like to adapt something like this:
my_text = soup.find_all(string=["ITEM1A.RISKFACTORS", "RISKFACTORS"])
To also capture anything that, once stripped of all spaces before and after (and possibly in-between words too), non-breaking spaces, and any other formatting that gets in the way, exactly matches one of my strings ("ITEM1A.RISKFACTORS", "RISKFACTORS").
I want:
"Item 1A. Risk Factors"
"Risk Factors"
" Risk Factors"
"Risk Factors"
"Item 1A.\xa0Risk Factors"
I don't want:
"We have a number of Risk Factors."
Ideally I'd like to avoid regex, if there is maybe some kind of "stripped=True" argument I can use with find_all that'd be better, but if regex is needed please help me write it! Thanks