0

I have a SGML file that mixes tags that require closing and those that don't. BeautifulSoup can prettify this for HTML, but my tags are custom and BeautifulSoup just closes them in the end of the file. Here's the source:

from bs4 import BeautifulSoup
import requests

url = 'https://www.sec.gov/Archives/edgar/data/1122304/000119312515118890/0001193125-15-118890.hdr.sgml'
sgml = requests.get(url).text
soup = BeautifulSoup(sgml, 'html5lib')

And here's the file:

<SEC-HEADER>0001193125-15-118890.hdr.sgml : 20150403
<ACCEPTANCE-DATETIME>20150403143902
<ACCESSION-NUMBER>0001193125-15-118890
<TYPE>DEF 14A
<PUBLIC-DOCUMENT-COUNT>37
<PERIOD>20150515
<FILING-DATE>20150403
<DATE-OF-FILING-DATE-CHANGE>20150403
<EFFECTIVENESS-DATE>20150403
<FILER>
<COMPANY-DATA>
<CONFORMED-NAME>AETNA INC /PA/
<CIK>0001122304
<ASSIGNED-SIC>6324
<IRS-NUMBER>232229683
<STATE-OF-INCORPORATION>PA
<FISCAL-YEAR-END>1231
</COMPANY-DATA>
...
</SEC-HEADER>

Where FILER and COMPANY-DATA requires a closing tag and others don't.

How can I tell BeautifulSoup's parser to close certain tags at the end of the line? Does it have something to do with how BS deals with br and li vs. a and div?

Anton Tarasenko
  • 8,099
  • 11
  • 66
  • 91
  • BeautifulSoup is parsing and extracting data from badly formatted HTML/XML, but if the broken HTML is ambiguous then it uses a set of rules to interpret the tags. This is something that you don't want. Why not use regular expressions for parsing the file instead of BeautifulSoup? – Christos Papoulas Apr 12 '17 at 09:49
  • @ChristosPapoulas For custom tags, BeautifulSoup had `selfClosingTags` parameter in the constructor (`BeautifulSoup()`). It's not there in BeautifulSoup4. E.g., see http://stackoverflow.com/questions/14961497/how-to-get-beautifulsoup-4-to-respect-a-self-closing-tag. BS4 says "The tree builder is responsible for understanding self-closing tags", but how to set them there? – Anton Tarasenko Apr 12 '17 at 10:14
  • 1
    http://stackoverflow.com/questions/12505419/parse-sgml-with-open-arbitrary-tags-in-python-3/12534420#12534420 might be of interest to you. – Bill Bell Apr 13 '17 at 03:41

1 Answers1

0

I didn't find how to control the tree builder in BeautifulSoup. I just closed the open tags with regular expressions (as suggested by @ChristosPapoulas) and ended up with a XML file.

Adding to the code I had in the question:

# Find all tags
all_tags = re.findall(
    r'<([^>/]+)>',
    sgml
)

# Find closed tags
closed_tags = re.findall(
    r'</([^>]+)>',
    sgml
)

# Deduce open tags
open_tags = [x for x in all_tags if x not in closed_tags]

# Closing open tags knowing that each of them takes just one line
sgml_xml = re.sub(
    r'(<({})>.*)'.format('|'.join(open_tags)),
    r'\1</\2>',
    sgml
)

Still curious how to manipulate tag properties in the tree builder.

Anton Tarasenko
  • 8,099
  • 11
  • 66
  • 91