Finding

Question

I am trying to scrape MSFT's income statement using code I found here: How to Web scraping SEC Edgar 10-K Dynamic data

They use the 'span' class to narrow the search. I do not see a span, so I am trying to use the <p class with no luck.

Here is my code, it is largely unchanged from the answer given. I changed the base_url and tried to change soup.find to 'p'. Is there a way to find the <p class or, even better, a way to find the income statement chart?

Here is the URL to the statement: https://www.sec.gov/Archives/edgar/data/789019/000156459018019062/msft-10k_20180630.htm

from bs4 import BeautifulSoup
import requests


headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}
# Obtain HTML for search page
base_url = "https://www.sec.gov/Archives/edgar/data/789019/000156459018019062/msft-10k_20180630.htm"
edgar_resp = requests.get(base_url, headers=headers)
edgar_str = edgar_resp.text


soup = BeautifulSoup(edgar_str, 'html.parser')
s =  soup.find('p', recursive=True, string='INCOME STATEMENTS ')
t = s.find_next('table')
trs = t.find_all('tr')
for tr in trs:
    if tr.text:
        print(list(tr.stripped_strings))

Here is the code from the example:

from bs4 import BeautifulSoup
import requests


headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}
# Obtain HTML for search page
base_url = "https://www.sec.gov/Archives/edgar/data/200406/000020040621000057/jnj-20210704.htm"
edgar_resp = requests.get(base_url, headers=headers)
edgar_str = edgar_resp.text


soup = BeautifulSoup(edgar_str, 'html.parser')
s =  soup.find('span', recursive=True, string='SALES BY SEGMENT OF BUSINESS ')
t = s.find_next('table')
trs = t.find_all('tr')
for tr in trs:
    if tr.text:
        print(list(tr.stripped_strings))

Thank you!

There's a space after `INCOME STATEMENTS`. Add that to your string. — Barmar, Dec 16 '22 at 02:50
I did, thank you though, Barmar. I keep getting the error: 'NoneType' object has no attribute 'find_next'. Which makes me assume it's not finding the 'p' class. — Dan, Dec 16 '22 at 02:53
Try using a regexp instead. `string=re.compile('INCOME STATEMENTS')` — Barmar, Dec 16 '22 at 02:55
I tried "s = soup.find('p', recursive=True, string=re.compile('INCOME STATEMENTS '))". Got the same error. Sorry, Barmar, I am fairly new to coding. I appreciate you trying to help me! Any other suggestions? — Dan, Dec 16 '22 at 02:58
I don't think you need `recursive=True`, but I don't think it should make a difference. — Barmar, Dec 16 '22 at 03:00

Barmar · Accepted Answer · 2022-12-16T03:07:51.150

0

I'm not sure why that's not working, but you can try this:

s = soup.find('a', attrs={'name':'INCOME_STATEMENTS'})

This should match the <a name="INCOME_STATEMENTS"></a> element inside that paragraph.

edited Dec 16 '22 at 03:07

answered Dec 16 '22 at 03:03

Barmar

741,623
53
500
612

Definitely getting close! I got this error: TypeError: find() got multiple values for argument 'name' – Dan Dec 16 '22 at 03:06
Fixed it, see https://stackoverflow.com/questions/2877114/parameters-for-find-function – Barmar Dec 16 '22 at 03:07
There shouldn't be a space at the end. – Barmar Dec 16 '22 at 03:13
You, Barmar, are a saint! Thank you so much! – Dan Dec 16 '22 at 03:13

Finding

1 Answers1