1

I am using Python BeautifulSoup to extract the risk factor section from 10K filings (so I need to locate where the section begins and ends). For example:

https://www.sec.gov/Archives/edgar/data/1321502/000143774910004615/andain_10k-123106.htm

To get this element:

ITEM 1A.  RISK FACTORS

I would like to adapt something like this:

my_text = soup.find_all(string=["ITEM1A.RISKFACTORS", "RISKFACTORS"])

To also capture anything that, once stripped of all spaces before and after (and possibly in-between words too), non-breaking spaces, and any other formatting that gets in the way, exactly matches one of my strings ("ITEM1A.RISKFACTORS", "RISKFACTORS").

I want: "Item 1A. Risk Factors" "Risk Factors" " Risk Factors" "Risk Factors" "Item 1A.\xa0Risk Factors"

I don't want: "We have a number of Risk Factors."

Ideally I'd like to avoid regex, if there is maybe some kind of "stripped=True" argument I can use with find_all that'd be better, but if regex is needed please help me write it! Thanks

ks123321
  • 51
  • 8
  • 1
    Welcome to SO. Help us to help you - Please improve your question, so that we can reproduce your issue. [How to create a Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example) Thanks Some written code, url, ... would be cool. – HedgeHog Jan 20 '21 at 22:15
  • Thank you! I've edited my post. – ks123321 Jan 20 '21 at 22:23

1 Answers1

0

In my opinion, the easiest way to do it is to use REGex. You can remove spacing between words like this:

>>> import re
>>> re.sub(' +', ' ', 'The     quick brown    fox')
'The quick brown fox'

To fix encoding you can try something like this:

r = urllib.urlopen('http://www.elnorte.ec/')
x = BeautifulSoup.BeautifulSoup(r.read)
r.close()

print(x.prettify('latin-1'))

And to search for those strings I would also suggest using regex, and that would look something along those lines

print(soup(text=re.compile('exact text')))

If you search for it won't only find "exact text" but also "almost exact text"

Jakov Gl.
  • 361
  • 3
  • 11