Parsing EDGAR filings

Question

I would like to use python2.7 to remove anything that isn't the documents' text from EDGAR filings (which are available online as .txt files). An example of what the files look like is here:

Example

EDGAR provides its Document Type Definitions starting on page 48 of this file:

DTD

The first part of my program gets the .txt file from the EDGAR online database into a local file that I've named "parseme.txt". What I would like to know is how to use the DTD to parse the .txt file. I would use a canned parsing module like BeautifulSoup for the job, but EDGAR's format appears unique, and I hope to avoid a large regex to get the job done.

import os
filename = 'parseme.txt'
with open(filename) as f:
    lines = f.readlines()

My question is related to the question at Parse SGML with Open Arbitrary Tags in Python 3 and Use lxml to parse text file with bad header in Python but I believe distinct as my question relates to python2.7 and I'm not concerned with the header - I'm just concerned with the text of the file.

I don't think the version of Python matters much here. Did you try any of the ideas that were provided in the answers to the linked questions? Where exactly are you stuck? — mzjn, Dec 26 '12 at 16:59

arayq2 · Answer 1 · 2012-12-31T21:16:38.003

Look at the OpenSP toolkit, which has programs to process SGML files. Your simplest option is probably to use the osx program to get an XML version of the input file, after which you can use XML processing tools.

There may be some setup to do first, as the OpenSP package doesn't come with the EDGAR DTD or its SGML declaration (the first part of the stuff in your reference at page 48, starting with <!SGML "ISO 8879-1986"). You will have to get these as text files and add them to the catalogs where the SP parser can find them.

UPDATE: This document seems to be a more up-to-date version. A casual google search doesn't turn up any immediately machine processable versions, though. So you may have to copy-paste from the PDF.

However, if you do so, there will be some extraneous formatting you'll have to remove: there seem to be page break indicators, labelled "C-1", "C-2", and so on. They are not part of SGML and need to be deleted.

You can either add the SGML declaration and the EDGAR DTD to the catalog (in which case the DTD file should only have the part inside the [ after <!DOCTYPE submission and the matching ] at the end) or you can create a "prolog" file consisting of both parts together as is (i.e. including the <!DOCTYPE submission [ and ]>) and run any program in the toolkit on the prolog and your SGML file - i.e. put both names on the command line, with the prolog file first, so that the parser will read both files in the correct order. To understand what's happening, you need to know that an SGML parser needs three pieces of information for a parse: an SGML declaration to set some environmental and processing parameters, then a DTD to describe the structural constraints on a document, and finally the document itself.

I posted [a similar answer](http://stackoverflow.com/a/12534420/407651) to one of the linked questions. But I haven't received any feedback. — mzjn, Dec 31 '12 at 20:40
These PEM-encapsulated messages don't look like EDGAR filings. Rather they seem to be taken from the correspondence archive. The relevant DTD must be elsewhere. — arayq2, Dec 31 '12 at 21:01

score 3 · Answer 2 · answered Jan 19 '14 at 02:54

3

The pysec project looks promising. It's a basic Django app that downloads the Edgar index and then allows you to download specific filings and extract financial parameters from the XBRL.

answered Jan 19 '14 at 02:54

Cerin

60,957
96
316
522

tsspires · Answer 3 · 2013-06-27T01:59:56.853

0

The link below is a library that parses EDGAR filings into a SQLite DB. It contains functionality to pull Form10k and Form8Qk filings from the EDGAR FPT site for years that you specify and load them into a normalized format in SQLite DB tables. Considering the poorly adhered to standard for the filings, writing your own parsing script would be a significant undertaking. That library and code similar to the below will load filings for the wanted quarter and from there you can simply query the table for the data you are seeking.

edgar.database.create()
# Load quarterly master index files into local sqlite db
quarters = []
#Q3 2009
quarters.add(2009,3)
#Q3 2008
quarters.add(2008,3)
edgar.database.load(quarters)

http://rf-contrib.googlecode.com/svn/trunk/ha/src/main/python/edgar/

edited Jun 27 '13 at 01:59

answered Jun 26 '13 at 22:14

tsspires

591
1
5
10

1

An answer that is mostly a link is discouraged on SO for many reasons. Could you paraphrase the important aspects of the link to help other users? – chrislondon Jun 26 '13 at 22:35
2

The link seems to require a password now – prewett Oct 24 '15 at 03:51
5

The link seems to return 404 not found now :-) – m3nda Nov 13 '16 at 21:40

score -1 · Answer 4 · answered Jul 03 '21 at 08:59

Check the two functions from edgarWebR( https://mwaldstein.github.io/edgarWebR/):

parse_submission()

parse_filing()

The parse_submission works on the SGML document you get from edgar.

parsed_submission <- try(parse_submission(my_file_name))

Then get the text from the parsed submission:

tmp <- parsed_submission[parsed_submission$TYPE=='10-K',]

content_text <- tmp$TEXT

finally you can get the items by parsing the filing

filing <- try(parse_text_filing(content_text))

Parsing EDGAR filings

4 Answers4

Linked