I have a SGML file that mixes tags that require closing and those that don't. BeautifulSoup can prettify this for HTML, but my tags are custom and BeautifulSoup just closes them in the end of the file. Here's the source:
from bs4 import BeautifulSoup
import requests
url = 'https://www.sec.gov/Archives/edgar/data/1122304/000119312515118890/0001193125-15-118890.hdr.sgml'
sgml = requests.get(url).text
soup = BeautifulSoup(sgml, 'html5lib')
And here's the file:
<SEC-HEADER>0001193125-15-118890.hdr.sgml : 20150403
<ACCEPTANCE-DATETIME>20150403143902
<ACCESSION-NUMBER>0001193125-15-118890
<TYPE>DEF 14A
<PUBLIC-DOCUMENT-COUNT>37
<PERIOD>20150515
<FILING-DATE>20150403
<DATE-OF-FILING-DATE-CHANGE>20150403
<EFFECTIVENESS-DATE>20150403
<FILER>
<COMPANY-DATA>
<CONFORMED-NAME>AETNA INC /PA/
<CIK>0001122304
<ASSIGNED-SIC>6324
<IRS-NUMBER>232229683
<STATE-OF-INCORPORATION>PA
<FISCAL-YEAR-END>1231
</COMPANY-DATA>
...
</SEC-HEADER>
Where FILER
and COMPANY-DATA
requires a closing tag and others don't.
How can I tell BeautifulSoup's parser to close certain tags at the end of the line? Does it have something to do with how BS deals with br
and li
vs. a
and div
?