Use lxml to parse text file with bad header in Python

Question

I would like to parse text files (stored locally) with lxml's etree. But all of my files (thousands) have headers, such as:

-----BEGIN PRIVACY-ENHANCED MESSAGE-----
Proc-Type: 2001,MIC-CLEAR
Originator-Name: webmaster@www.sec.gov
Originator-Key-Asymmetric:
 MFgwCgYEVQgBAQICAf8DSgAwRwJAW2sNKK9AVtBzYZmr6aGjlWyK3XmZv3dTINen
 TWSM7vrzLADbmYQaionwg5sDW3P6oaM5D3tdezXMm7z1T+B+twIDAQAB
MIC-Info: RSA-MD5,RSA,
 AHxm/u6lqdt8X6gebNqy9afC2kLXg+GVIOlG/Vrrw/dTCPGwM15+hT6AZMfDSvFZ
 YVPEaPjyiqB4rV/GS2lj6A==

<SEC-DOCUMENT>0001193125-07-200376.txt : 20070913
<SEC-HEADER>0001193125-07-200376.hdr.sgml : 20070913
<ACCEPTANCE-DATETIME>20070913115715
ACCESSION NUMBER:       0001193125-07-200376
CONFORMED SUBMISSION TYPE:  10-K
PUBLIC DOCUMENT COUNT:      7
CONFORMED PERIOD OF REPORT: 20070630
FILED AS OF DATE:       20070913
DATE AS OF CHANGE:      20070913

and the first < isn't until line 51 in this case (and isn't 51 in all cases). The xml portions starts as follows:

</SEC-HEADER>
<DOCUMENT>
<TYPE>10-K
<SEQUENCE>1
<FILENAME>d10k.htm
<DESCRIPTION>FORM 10-K
<TEXT>
<HTML><HEAD>
<TITLE>Form 10-K</TITLE>
</HEAD>
 <BODY BGCOLOR="WHITE">
<h5 align="left"><a href="#toc">Table of Contents</a></h5>

Can I handle this on-the-fly with lxml? Or should I use a stream editor to omit each file's header? Thanks!

Here is my current code and error.

from lxml import etree
f = etree.parse('temp.txt')

XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

Edit:

FWIW, here is a link to the file.

Ah, OK. Let me learn about those (sorry, just getting started, and it's not in the man file, then sometimes I'm not sure what function to look for next). Thanks! — Richard Herron, Sep 13 '12 at 19:08
Oh no problem, you can skip lines with an opened file using file.readline(), and then you can use etree.parse(file). the StringIO is because I wasn't sure if parse() accepts file objects, forget about it. — unddoch, Sep 13 '12 at 19:13
It looks like the headers are RFC822 format, which means there's no guarantee there won't be a '<' somewhere in the headers. I'd either use some RFC822 parsing code, or just readline until I get a blank line. — abarnert, Sep 13 '12 at 20:29
First of all, you definitely need to strip the PEM headers before attempting to parse the markup. Secondly, unfortunately that's SGML, not XML. Parsing SGML correctly is quite a bit more challenging than parsing well formed XML. So, could you narrow down what information you actually need to extract? Just the HTML inside the `` node, or also any of the metadata before it? — Lukas Graf, Sep 13 '12 at 20:52
@LukasGraf -- Thanks! I only need the content inside the first ``, `` tags. — Richard Herron, Sep 13 '12 at 21:16
@abarnert -- Yes, there's a standard (thanks for the nudge) -- http://www.sec.gov/info/edgar/pdsdissemspec910.pdf — Richard Herron, Sep 14 '12 at 00:14

score 6 · Accepted Answer · answered Sep 14 '12 at 19:51

Given that there's a standard for these files, it's possible to write a proper parser rather than guessing at things, or hoping beautifulsoup gets things right. That doesn't mean it's the best answer for you, but it's certainly work looking at.

According to the standard at http://www.sec.gov/info/edgar/pdsdissemspec910.pdf what you've got (inside the PEM enclosure) is an SGML document defined by the provided DTD. So, first go to pages 48-55, extract the text there, and save it as, say, "edgar.dtd".

The first thing I'd do is install SP and use its tools to make sure that the documents really are valid and parseable by that DTD, to make sure you don't waste a bunch of time on something that isn't going to pan out.

Python comes with a validating SGML parser, sgmllib. Unfortunately, it was never quite finished, and it's deprecated in 2.6-2.7 (and removed in 3.x). But that doesn't mean it won't work. So, try it and see if it works.

If not, I don't know of any good alternatives in Python; most of the SGML code out there is in C, C++, or Perl. But you can wrap up any C or C++ library (I'd start with SP) pretty easily, as long as you're comfortable writing your own wrapped in C/Cython/boost-python/whatever or using ctypes. You only need to wrap up the top-level functions, not build a complete set of bindings. But if you've never done anything like this before, it's probably not the best time to learn.

Alternatively, you can wrap up a command-line tool. SP comes with nsgmls. There's another good tool written in perl with the same name (I think part of http://savannah.nongnu.org/projects/perlsgml/ but I'm not positive.) And dozens of other tools.

Or, of course, you could write the whole thing, or just the parsing layer, in perl (or C++) instead of Python.

Thanks! So much to learn here. I have had luck with `BeautifulStoneSoup` (recommended by @jterrace). I want the ability to quickly find a given section and search the text, so I think BSS will do fine (although I need to drop many, many "\n" and "&nbsp" from the list). — Richard Herron, Sep 15 '12 at 00:40
I'd try running SP's nsgmls on some of the documents, with the DTD, and see if the output looks right. You may still want to use BeautifulSoup just for simplicity, but it's worth knowing if you have other options. (If the documents don't validate, you have no other options; if they do, you do.) — abarnert, Sep 17 '12 at 19:03
Thanks! The full solution will take me some time (too many plates spinning), but thanks for the pointers! — Richard Herron, Sep 19 '12 at 02:11
Dear @RichardHerron i am currently on work. I am trying to do the same exact thing for the last 2 years! I have the files downloaded as `txt` but i don't know how to use the `BeautifulSoup` because of the txt's format. I am a finance enthusiast and i really need to crunch data. Please feel free to drop me a line and talk if you have time. You would be a life saver. — ExoticBirdsMerchant, Apr 22 '14 at 21:09

score 4 · Answer 2 · edited Aug 26 '20 at 05:41

4

You can easily get to the encapsulated text of the PEM (Privacy-Enhanced Message, specified in RFC 1421 ) by stripping the encapsulation boundries and separating everything in between into header and encapsulated text at the first blank line.

The SGML parsing is much more difficult. Here's an attempt that seems to work with a document from EDGAR:

from lxml import html

PRE_EB = "-----BEGIN PRIVACY-ENHANCED MESSAGE-----"
POST_EB = "-----END PRIVACY-ENHANCED MESSAGE-----"

def unpack_pem(pem_string):
    """Takes a PEM encapsulated message and returns a tuple
    consisting of the header and encapsulated text.  
    """

    if not pem_string.startswith(PRE_EB):
        raise ValueError("Invalid PEM encoding; must start with %s"
                         % PRE_EB)
    if not pem_string.strip().endswith(POST_EB):
        raise ValueError("Invalid PEM encoding; must end with %s"
                         % POST_EB)
    msg = pem_string.strip()[len(PRE_EB):-len(POST_EB)]
    header, encapsulated_text = msg.split('\n\n', 1)
    return (header, encapsulated_text)


filename = 'secdoc_htm.txt'
data = open(filename, 'r').read()

header, encapsulated_text = unpack_pem(data)

# Now parse the SGML
root = html.fromstring(encapsulated_text)
document = root.xpath('//document')[0]

metadata = {}
metadata['type'] = document.xpath('//type')[0].text.strip()
metadata['sequence'] = document.xpath('//sequence')[0].text.strip()
metadata['filename'] = document.xpath('//filename')[0].text.strip()

inner_html = document.xpath('//text')[0]

print(metadata)
print(inner_html)

Result:

{'filename': 'd371464d10q.htm', 'type': '10-Q', 'sequence': '1'}

<Element text at 80d250c>

edited Aug 26 '20 at 05:41

Jinhua Wang

1,679
1
17
44

answered Sep 13 '12 at 21:19

Lukas Graf

30,317
8
77
92

Thanks, Lukas. So to make sure I understand, because SGML is less structured than XML or HTML, the best that I can hope for is a more manual solution? – Richard Herron Sep 13 '12 at 21:24
FB. Nice. Thanks for the help! I have to run, but I will try this tonight. Thanks. – Richard Herron Sep 13 '12 at 21:34
1

Exactly. SGML allows for implicitly closed tags, in this example ``, ``, `` and ``. This creates ambiguity, and when parsing the document with `lxml.html` it seems to nest them instead of keeping them flat. – Lukas Graf Sep 13 '12 at 21:35
3

It's not really that SGML is "less structured", but that it's more flexible, and without knowing what SGML language you're using, you're more in the dark. XML and HTML, at least some versions of them, are themselves SGML languages. In informal terms, reading XML without a DTD is like reading something which could be either Standard High German or Yiddish; reading SGML without a DTD is like reading something which is either some dialect of German or some dialect of English… – abarnert Sep 13 '12 at 21:41
@abarnert True, XML and HTML are both subsets of SGML. And +1 for a great metaphor :) – Lukas Graf Sep 13 '12 at 21:45
Strictly speaking, HTML 3-4 are SGML languages, and therefore subsets of SGML, but earlier HTML is not quite, and HTML5 is… well, an abstract language with two concrete serializations, one of which happens to look a lot like an SGML language but is explicitly defined not to be. (The other is of course an XML language, and XML is an SGML language, so you could say HTML is still SGML… except that it's possible to write HTML5 documents that aren't "polyglot" and can't be serialized in XHTML.) I wouldn't mention any of this, but the OP seemed excited about having so much new stuff to learn. :) – abarnert Sep 13 '12 at 22:10
@abarnert -- I don't know if "excited" is the right word. I guess this isn't an afternoon project. :) Thanks to both of you for all the help! – Richard Herron Sep 14 '12 at 00:16

abarnert · Answer 3 · 2012-09-13T22:16:11.627

Although the problem definition implies you want to start parsing at the first '<', I don't think this is a good idea. Those look like PEM headers (if not, they're something else derived from RFC(2)822), and they could have '<' characters in them. For example, you might find Originator-Name: "Foo Bar" <foo@bar.edu> one day. It's possible that the particular files you're looking at never will, but unless you can know that for sure, it's better not to rely on it.

If you want to actually parse this as an RFC822 message with an XML body, that's pretty easy:

with file('temp.txt') as f:
  rfc822.Message(f).rewindbody()
  x = etree.parse(f)

But technically this isn't valid for PEM (because PEM's header-body format is effectively a fork of RFC822 rather than incorporating it by reference). And it may not be even practically valid for various other similar not-quite-RFC822 formats. And really, all you care about is how headers and bodies are separated, which is a very simple rule:

with file('temp.txt') as f:
  while f.readline():
    pass
  x = etree.parse(f)

The other alternative is to rely on the (apparent) fact that the body is always a SEC-DOCUMENT node:

with file('temp.txt') as f:
  text = f.read()
body = '<SEC-DOCUMENT>' + text.split('<SEC-DOCUMENT>, 1)[1]
x = etree.fromstring(body)

One last note: Generally, once you see RFC822 headers, that raises the question of whether the format is actually full RFC2822 + optional MIME. The fact that there's no content headers anywhere implies that you're probably safe here, but you might want to grep a large collection of them (or, if there's a definition of the file format somewhere, skim it over).

As Lukas Graf points out above, the body appears to be an SGML document rather than an XML one, in which case all of this will just get you past the first hurdle to the point where your XML parse can start failing for the right reasons instead of the wrong ones… — abarnert, Sep 13 '12 at 21:12
Yep. You use lxml's HTML or even BeautifulSoup parser to get _something_ out of it, but it seems to incorrectly nest the implicitely closed tags. — Lukas Graf, Sep 13 '12 at 21:14
@LukasGraf -- You're right. Even if I properly pair `` tags, lxml's `etree` throws me errors. But if I use lxml's `html`, then I can get results. Clearly I have a lot more work here. Thanks! — Richard Herron, Sep 13 '12 at 21:30
I've edited my answer to deal with the fact that this is PEM, not strictly speaking RFC822. I didn't edit it to deal with the fact that it's SGML rather than XML, because I can't improve on Lukas's answer, and if this isn't useful as a supplemental answer showing how to parse off the header, I'd rather just delete it. — abarnert, Sep 13 '12 at 22:17

score 1 · Answer 4 · answered Sep 13 '12 at 20:55

You could use BeautifulSoup for this:

>>> from BeautifulSoup import BeautifulStoneSoup
>>> soup = BeautifulStoneSoup(xmldata)
>>> print soup.prettify()
-----BEGIN PRIVACY-ENHANCED MESSAGE-----
Proc-Type: 2001,MIC-CLEAR
Originator-Name: webmaster@www.sec.gov
Originator-Key-Asymmetric:
 MFgwCgYEVQgBAQICAf8DSgAwRwJAW2sNKK9AVtBzYZmr6aGjlWyK3XmZv3dTINen
 TWSM7vrzLADbmYQaionwg5sDW3P6oaM5D3tdezXMm7z1T+B+twIDAQAB
MIC-Info: RSA-MD5,RSA,
 AHxm/u6lqdt8X6gebNqy9afC2kLXg+GVIOlG/Vrrw/dTCPGwM15+hT6AZMfDSvFZ
 YVPEaPjyiqB4rV/GS2lj6A==
<sec-document>
 0001193125-07-200376.txt : 20070913
 <sec-header>
  0001193125-07-200376.hdr.sgml : 20070913
  <acceptance-datetime>
   20070913115715
ACCESSION NUMBER:       0001193125-07-200376
CONFORMED SUBMISSION TYPE:  10-K
PUBLIC DOCUMENT COUNT:      7
CONFORMED PERIOD OF REPORT: 20070630
FILED AS OF DATE:       20070913
DATE AS OF CHANGE:      20070913
  </acceptance-datetime>
 </sec-header>
 <document>
  <type>
   10-K
   <sequence>
    1
    <filename>
     d10k.htm
     <description>
      FORM 10-K
      <text>
       <html>
        <head>
         <title>
          Form 10-K
         </title>
        </head>
        <body bgcolor="WHITE">
         <h5 align="left">
          <a href="#toc">
           Table of Contents
          </a>
         </h5>
        </body>
       </html>
      </text>
     </description>
    </filename>
   </sequence>
  </type>
 </document>
</sec-document>

Use lxml to parse text file with bad header in Python

4 Answers4

Linked