0

I am trying to scrape MSFT's income statement using code I found here: How to Web scraping SEC Edgar 10-K Dynamic data

They use the 'span' class to narrow the search. I do not see a span, so I am trying to use the <p class with no luck.

Here is my code, it is largely unchanged from the answer given. I changed the base_url and tried to change soup.find to 'p'. Is there a way to find the <p class or, even better, a way to find the income statement chart?

Here is the URL to the statement: https://www.sec.gov/Archives/edgar/data/789019/000156459018019062/msft-10k_20180630.htm

from bs4 import BeautifulSoup
import requests


headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}
# Obtain HTML for search page
base_url = "https://www.sec.gov/Archives/edgar/data/789019/000156459018019062/msft-10k_20180630.htm"
edgar_resp = requests.get(base_url, headers=headers)
edgar_str = edgar_resp.text


soup = BeautifulSoup(edgar_str, 'html.parser')
s =  soup.find('p', recursive=True, string='INCOME STATEMENTS ')
t = s.find_next('table')
trs = t.find_all('tr')
for tr in trs:
    if tr.text:
        print(list(tr.stripped_strings))

Here is the code from the example:

from bs4 import BeautifulSoup
import requests


headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}
# Obtain HTML for search page
base_url = "https://www.sec.gov/Archives/edgar/data/200406/000020040621000057/jnj-20210704.htm"
edgar_resp = requests.get(base_url, headers=headers)
edgar_str = edgar_resp.text


soup = BeautifulSoup(edgar_str, 'html.parser')
s =  soup.find('span', recursive=True, string='SALES BY SEGMENT OF BUSINESS ')
t = s.find_next('table')
trs = t.find_all('tr')
for tr in trs:
    if tr.text:
        print(list(tr.stripped_strings))

Thank you!

Dan
  • 15
  • 5
  • There's a space after `INCOME STATEMENTS`. Add that to your string. – Barmar Dec 16 '22 at 02:50
  • I did, thank you though, Barmar. I keep getting the error: 'NoneType' object has no attribute 'find_next'. Which makes me assume it's not finding the 'p' class. – Dan Dec 16 '22 at 02:53
  • Try using a regexp instead. `string=re.compile('INCOME STATEMENTS')` – Barmar Dec 16 '22 at 02:55
  • I tried "s = soup.find('p', recursive=True, string=re.compile('INCOME STATEMENTS '))". Got the same error. Sorry, Barmar, I am fairly new to coding. I appreciate you trying to help me! Any other suggestions? – Dan Dec 16 '22 at 02:58
  • I don't think you need `recursive=True`, but I don't think it should make a difference. – Barmar Dec 16 '22 at 03:00

1 Answers1

0

I'm not sure why that's not working, but you can try this:

s = soup.find('a', attrs={'name':'INCOME_STATEMENTS'})

This should match the <a name="INCOME_STATEMENTS"></a> element inside that paragraph.

Barmar
  • 741,623
  • 53
  • 500
  • 612