I'm attempting to parse EDGAR documents - they're SEC filings. Specifically, I'm attempting to parse both SEC Schedule 13D and Schedule 13G filings.
There appears to be lots of failed attempts at parsing these filings, and I assume that's because doing so is a behemoth task that an entire team would have to tackle.
I was tasked with parsing those filings. We need the information from the data tables found throughout. The problem is that the filings on record make it hard for me to distinguish between data points, table section headers, etc.
So far, I've only been able to scrape information from around 10% of the Schedule 13D files, and even what I've scraped need considerable cleaning. In a nutshell, I'm matching a regular expression pattern to text. The pattern takes one known (English) section header and the one that comes next (I set each manually) and extracts what's in between: e.g., CHECK THE APPROPRIATE BOX IF A MEMBER OF A GROUP(.*?)SEC USE ONLY
. Clearly, that's not going to get me very far, and it isn't. Using the same logic, here's what I get based on the following example string (as an example):
example text
NAMES OF REPORTING PERSONS I.R.S. IDENTIFICATION NOS. OF ABOVE PERSONS (ENTITIES ONLY)Robert DePaloCHECK THE APPROPRIATE BOX IF A MEMBER OF A GROUP(see instructions)(a) (b) SEC USE ONLYSOURCE OF FUNDS (see instructions)CHECK BOX IF DISCLOSURE OF LEGAL PROCEEDINGS IS REQUIRED PURSUANT TO ITEMS 2(d) or 2(e) CITIZENSHIP OR PLACE OF ORGANIZATIONUnited StatesSOLE VOTING POWER45,119,857 (1)SHARED VOTING POWER-0-SOLE DISPOSITIVE POWER45,119,857 (1)10.SHARED DISPOSITIVE POWER-0-11.AGGREGATE AMOUNT BENEFICIALLY OWNED BY EACH REPORTING PERSON45,119,857 (1)12.CHECK BOX IF THE AGGREGATE AMOUNT IN ROW (11) EXCLUDES CERTAIN SHARES(see instructions) 13.PERCENT OF CLASS REPRESENTED BY AMOUNT IN ROW (11)33.4% (2)14.TYPE OF REPORTING PERSON (see instructions)(1) Consists of 44,194,298 shares of Common Stock held by the Reporting Person and 925,559 shares of Common Stock held by Arjent Limited UK. The Reporting Person is the Chairman of Arjent Limited UK and has voting and investment authority over shares held by it. Does not include any classes of preferred shares that the Reporting Person and an entity owned by the Reporting Person’s wife are entitled to receive, as discussed in Item 6 below.(2) Does not include the voting interest that the Reporting Person is entitled to receive under the SPHC Series B Preferred Shares, as discussed in Item 6 of this Schedule 13D.
example output
key: CHECK THE | v: (a)    (b)    
key: CITIZENSHI | v: United States
key: CHECK BOX | v:      
key: SHARED VOT | v: -0-
key: PERCENT OF | v: PERCENT OF CLASS REPRESENTED BY AMOUNT IN ROW \(11\)
key: TYPE OF RE | v: TYPE OF REPORTING PERSON \(see instructions\)
key: CHECK BOX | v:     13.
key: SOLE DISPO | v: 45,119,857
key: SEC USE ON | v: SEC USE ONLY
key: SHARED DIS | v: -0
key: SOLE VOTIN | v: 45,119,857
key: NAMES OF R | v: Robert DePalo
key: AGGREGATE | v: 45,119,857 12.
key: SOURCE OF | v: SOURCE OF FUNDS \(see instructions\)
Are there any other approaches? This doesn't work for most of the 13D filings, and it won't work for 13G. I have a feeling I'm a little too naive in my approach, and I need a common approach to a problem like this. I'm looking to scrape at least 80% of at least 80% of the filings.