How do I scrape html without closing tag using Python?

Question

I got following information from EDGAR:

<SERIES-AND-CLASSES-CONTRACTS-DATA>
<EXISTING-SERIES-AND-CLASSES-CONTRACTS>
<SERIES>
<OWNER-CIK>0000074663
<SERIES-ID>S000004984
<SERIES-NAME>Eaton Vance Income Fund of Boston
<CLASS-CONTRACT>
<CLASS-CONTRACT-ID>C000013484
<CLASS-CONTRACT-NAME>Eaton Vance Income Fund of Boston Class A
<CLASS-CONTRACT-TICKER-SYMBOL>EVIBX
</CLASS-CONTRACT>
<CLASS-CONTRACT>
<CLASS-CONTRACT-ID>C000013485
<CLASS-CONTRACT-NAME>Eaton Vance Income Fund of Boston Class B
<CLASS-CONTRACT-TICKER-SYMBOL>EBIBX
</CLASS-CONTRACT>
<CLASS-CONTRACT>
<CLASS-CONTRACT-ID>C000013486
<CLASS-CONTRACT-NAME>Eaton Vance Income Fund of Boston Class C
<CLASS-CONTRACT-TICKER-SYMBOL>ECIBX
</CLASS-CONTRACT>
<CLASS-CONTRACT>
<CLASS-CONTRACT-ID>C000013487
<CLASS-CONTRACT-NAME>Eaton Vance Income Fund of Boston Class R
<CLASS-CONTRACT-TICKER-SYMBOL>ERIBX
</CLASS-CONTRACT>
<CLASS-CONTRACT>
<CLASS-CONTRACT-ID>C000013488
<CLASS-CONTRACT-NAME>Eaton Vance Income Fund of Boston Class I
<CLASS-CONTRACT-TICKER-SYMBOL>EIBIX
</CLASS-CONTRACT>
</SERIES>
</EXISTING-SERIES-AND-CLASSES-CONTRACTS>
</SERIES-AND-CLASSES-CONTRACTS-DATA>

I would ideally like to scrape all information for each tag and its subtags. It seems that for tags within class contract (e.g., class-contract-id) does not have closing tag.

Possibly for this reason, I get the following result when I try this out:

from bs4 import BeautifulSoup

with open("temp.txt",'r') as html_file:
    content = html_file.read()
    soup = BeautifulSoup(content, 'lxml')
        
    series = soup.find('series') 
    
    for item in series:
        cik = item.find('owner-cik')
        print(cik)

Result:

-1
None

Is there any possible way to sort this out?

That's not HTML. It's not even XML (could be SGML), but you can use the techniques listed in the duplicate link to clean it up so it can be parsed as XML. — kjhughes, May 18 '22 at 01:06

score 0 · Answer 1 · answered May 18 '22 at 01:20

The issue is that in this case, item itself is the OWNER-CIK tag. series.find('owner-cik') will probably do what you want, as page 33 of the specification seems to say there's only one OWNER CIK per SERIES.

It looks like there are also a number of existing python libraries for downloading/parsing EDGAR data. You may be able to use or modify one of those instead.

How do I scrape html without closing tag using Python?

1 Answers1