How would I approach a lot of structured-but-inconsistent data?

Question

I'm attempting to parse EDGAR documents - they're SEC filings. Specifically, I'm attempting to parse both SEC Schedule 13D and Schedule 13G filings.

There appears to be lots of failed attempts at parsing these filings, and I assume that's because doing so is a behemoth task that an entire team would have to tackle.

I was tasked with parsing those filings. We need the information from the data tables found throughout. The problem is that the filings on record make it hard for me to distinguish between data points, table section headers, etc.

So far, I've only been able to scrape information from around 10% of the Schedule 13D files, and even what I've scraped need considerable cleaning. In a nutshell, I'm matching a regular expression pattern to text. The pattern takes one known (English) section header and the one that comes next (I set each manually) and extracts what's in between: e.g., CHECK THE APPROPRIATE BOX IF A MEMBER OF A GROUP(.*?)SEC USE ONLY. Clearly, that's not going to get me very far, and it isn't. Using the same logic, here's what I get based on the following example string (as an example):

example text

NAMES OF REPORTING PERSONS I.R.S. IDENTIFICATION NOS. OF ABOVE PERSONS (ENTITIES ONLY)Robert DePaloCHECK THE APPROPRIATE BOX IF A MEMBER OF A GROUP(see instructions)(a) (b) SEC USE ONLYSOURCE OF FUNDS (see instructions)CHECK BOX IF DISCLOSURE OF LEGAL PROCEEDINGS IS REQUIRED PURSUANT TO ITEMS 2(d) or 2(e) CITIZENSHIP OR PLACE OF ORGANIZATIONUnited StatesSOLE VOTING POWER45,119,857 (1)SHARED VOTING POWER-0-SOLE DISPOSITIVE POWER45,119,857 (1)10.SHARED DISPOSITIVE POWER-0-11.AGGREGATE AMOUNT BENEFICIALLY OWNED BY EACH REPORTING PERSON45,119,857 (1)12.CHECK BOX IF THE AGGREGATE AMOUNT IN ROW (11) EXCLUDES CERTAIN SHARES(see instructions) 13.PERCENT OF CLASS REPRESENTED BY AMOUNT IN ROW (11)33.4% (2)14.TYPE OF REPORTING PERSON (see instructions)(1) Consists of 44,194,298 shares of Common Stock held by the Reporting Person and 925,559 shares of Common Stock held by Arjent Limited UK. The Reporting Person is the Chairman of Arjent Limited UK and has voting and investment authority over shares held by it. Does not include any classes of preferred shares that the Reporting Person and an entity owned by the Reporting Person’s wife are entitled to receive, as discussed in Item 6 below.(2) Does not include the voting interest that the Reporting Person is entitled to receive under the SPHC Series B Preferred Shares, as discussed in Item 6 of this Schedule 13D.

Are there any other approaches? This doesn't work for most of the 13D filings, and it won't work for 13G. I have a feeling I'm a little too naive in my approach, and I need a common approach to a problem like this. I'm looking to scrape at least 80% of at least 80% of the filings.

Its hard to say without seeing the variations, but the first documents you linked too look much easier to parse because of their spacing. Can you not preserve that when you go to parse? — dmgig, Apr 21 '15 at 20:53
@dgig So preserve the spacing? Then, parse based on spacing? — Mr_Spock, Apr 21 '15 at 21:02
So, you are trying to use regex (the worst device for parsing other than no regex) to parse documents that are written in stilted natural-language english with a huge legal bend. Yes, you've picked a pretty hard task. There are well-funded startups in Silicon Valley that are attempting to do this (so far, they haven't eclipsed FaceBook in the stock market). Unless you think you have especially new ideas about this, I think you are fighting a losing battle from the start. I suggest you go read everything you can find on parsing English or legal documents, and then reconsider your approach. — Ira Baxter, Apr 22 '15 at 10:31
@IraBaxter Look, the last thing I want to do is use regex -- it's extremely complicated to create anything sophisticated and extensible with it. But I looked to regex because of its flexibility (gives me a considerable amount of freedom). Please provide an alternative. I'm using Python. It's easy to manipulate and search strings via list comprehensions. I could do that instead. Otherwise, let me know what you'd use. — Mr_Spock, Apr 22 '15 at 11:31
You probably need to use a NL parser (there is a sort of famous one from Stanford). Then you have to deal with the stilted English. You need a knowledge base containing facts about structures of legal documents, expectations of content, context for interpretation, ... this amounts to Big AI. That's what I recommend, and I told you how I thought you should proceed. You seem insistent on plunging ahead. Good luck with that. "Those who don't know the past [what other people have done], are doomed to repeat it." (Actually, usually doing something dumber). — Ira Baxter, Apr 22 '15 at 13:17
@IraBaxter It's not me who's insistent. My boss wants output. I want to do more research and design, but it's "Try the most obvious attack schemes, even if they aren't great nor clear, and see what hits immediately. If it doesn't work, we move on to something else, although we're going to have to go back to this problem later." Trust me when I say I'd like to take a more systematic approach to the problem. You have over 45 years in the game -- maybe this is more a programmers.stackexchange problem, considering my 'plunging' isn't particularly an incompetency issue, but more a cultural one. ha — Mr_Spock, Apr 22 '15 at 15:02
OK, I guess I'm sympathetic with your situation but I can't help you with your boss. The world is far too full of Captain Picards, saying "Make It So" without having any clue whether So can be Made in any reasonable way. That only works on TV. You can make me the fall guy in your explanation if you want; I'm used to wearing asbestos suits. You can tell him that some old fogey said you need an Natural Language parser, and so you want to pursue that. That will let you burn cycles getting it all hooked up and trying to extract data; it won't be enough, but it will look like you are trying. — Ira Baxter, Apr 22 '15 at 15:33
Try this link: http://nlp.stanford.edu/software/lex-parser.shtml — Ira Baxter, Apr 22 '15 at 15:35
@IraBaxter Thank you so much. Seriously. I feel a lot better now that I got advice from a vet. I'll take a gander at that link. I think you're absolutely right -- and love the Star Trek ref. hahaha — Mr_Spock, Apr 22 '15 at 15:36
@Mr_Spock: the docs you cite in the OP contain HTML, and as such, contains lots of formatting info that also serves to delimit fields. Is that true of all 13D and 13G forms? If so, I won't repeat the tired trope "thou shalt not use RegEx to parse HTML", but I'd suggest looking into using XML or CSS queries to pull out the good stuff. For example, check out BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/) or other such packages. — fearless_fool, Nov 16 '15 at 06:52

How would I approach a lot of structured-but-inconsistent data?

0 Answers0