I would like to use python2.7 to remove anything that isn't the documents' text from EDGAR filings (which are available online as .txt files). An example of what the files look like is here:
EDGAR provides its Document Type Definitions starting on page 48 of this file:
The first part of my program gets the .txt file from the EDGAR online database into a local file that I've named "parseme.txt". What I would like to know is how to use the DTD to parse the .txt file. I would use a canned parsing module like BeautifulSoup for the job, but EDGAR's format appears unique, and I hope to avoid a large regex to get the job done.
import os
filename = 'parseme.txt'
with open(filename) as f:
lines = f.readlines()
My question is related to the question at Parse SGML with Open Arbitrary Tags in Python 3 and Use lxml to parse text file with bad header in Python but I believe distinct as my question relates to python2.7 and I'm not concerned with the header - I'm just concerned with the text of the file.