5

I tried to parse SEC company filings from sec.gov. Starting from fb 10-Q index.htm let's look at a complete text submission filing like complete submission text filing. It has a structure like:

<SEC-DOCUMENT>
<SEC-HEADER>
<ACCEPTANCE-DATETIME>"some content" This tag is not closed.

"some lines resembling yaml markup" 
These are indented lines with a 
"key": "value" structure.

</SEC-HEADER>
<DOCUMENT>
.
.
some content
.
.

</DOCUMENT>

"several DOCUMENT tags" ...


</SEC-DOCUMENT>

I tried to figure out the structure of the <SEC-HEADER> tag and found some information under Public Dissemination Service (PDS) Technical Specification (pdf) and concluded that the content of the header should be SGML.

Nevertheless, I am clueless about the formatting, since there are no angle brackets, and the keys - value paires are separated by colons like key: value instead of <key>value</key>. In the pdf link I could not find anything about colons.

Question: Is the <SEC-HEADER> tag valid SGML? If it is, how to parse it?

I'd be glad at any help.

Michael S
  • 466
  • 1
  • 4
  • 12

1 Answers1

1

The short answer is no. The <SEC-HEADER> tag in the raw filing is not a valid SGML.

However, it is my understanding that this section in the raw filing is parsed automatically from the header file <accession_num>.hdr.sgml, which does follow SGML. This header file can be found in the same directory as the raw filing (i.e., the <accession_num>.txt file).

I use a REGEX of the form: ^<(.+?)>(.+?)$ (with re.MULTILINE option) to capture each (tag, value) tuple and get the results directly in a dict(). I believe the only tag in that file that has a closing tag is the </FILER> tag, where there could be multiple filers in each filing. You can first extract those using a REGEX of the form: <FILER>(.+?)</FILER> and then employ the same REGEX as above to get the inner tags for each filer.

Note that other than 'FILER', there could be other tags, representing different relations of the entities to the filing. Those are 'ISSUER', 'SUBJECT COMPANY', 'FILED BY', 'FILED FOR', 'SERIAL COMPANY', 'REPORTING OWNER'.

Sepp L.
  • 11
  • 2