1

I'm assuming that I have a database of the desired SEC filings (Form 10s initially). Most of the files are HTML tags; they look like this:

<DOCUMENT>
<TYPE>10-K
<SEQUENCE>1
<FILENAME>d445434d10k.htm
<DESCRIPTION>FORM 10-K
<TEXT>
<HTML><HEAD>
<TITLE>Form 10-K</TITLE>
</HEAD>
 <BODY BGCOLOR="WHITE">
<h5 align="left"><a href="#toc">Table of Contents</a></h5>
<div style="line-height:120%;font-size:8pt;"><font style="font-family:inherit;font-size:8pt;">&#160;</font></div><div style="line-height:120%;text-indent:32px;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;font-style:italic;">All references in this Form 10-K to the &#8220;Company&#8221;, &#8220;Contango&#8221;, &#8220;we&#8221;, &#8220;us&#8221; or &#8220;our&#8221; are to Contango Oil&#160;&amp; Gas Company and wholly-owned Subsidiaries. Unless otherwise noted, all information in this Form 10-K relating to natural gas and oil reserves and the estimated future net cash flows attributable to those reserves are based on estimates prepared by independent engineers and are net to our interest.</font></div>

I want to eventually get each filing into sections in a database.

For example this:

Overview</font></div><div style="line-height:120%;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;"><br></font></div><div style="line-height:120%;text-indent:48px;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;">Contango is a Houston, Texas based, independent natural gas and oil company.&#160; The Company's core business is to explore, develop, produce and acquire natural gas and oil properties offshore in the shallow waters of the Gulf of Mexico.&#160; Contango Operators, Inc. (&#8220;COI&#8221;), our wholly-owned subsidiary, acts as operator of our offshore properties.  Contango has additional onshore investments in i) Alta Resources Investments, LLC ("Alta"), whose primary area of focus is the liquids-rich Kaybob Duvernay in Alberta, Canada; ii) Exaro Energy III LLC ("Exaro"), which is primarily focused on the development of proved natural gas reserves.

Into:

Overview Contango is a Houston, Texas based, independent natural gas and oil company.&#160; The Company's core business is to explore, develop, produce and acquire natural gas and oil properties offshore in the shallow waters of the Gulf of Mexico.&#160; Contango Operators, Inc. (&#8220;COI&#8221;), our wholly-owned subsidiary, acts as operator of our offshore properties.  Contango has additional onshore investments in i) Alta Resources Investments, LLC ("Alta"), whose primary area of focus is the liquids-rich Kaybob Duvernay in Alberta, Canada; ii) Exaro Energy III LLC ("Exaro"), which is primarily focused on the development of proved natural gas reserves

...and be able to call each section up into custom views (making custom and condensed versions; say just the Item 1. Business and Segment Information), get rid of the boilerplate stuff. My model will have the type, filename and certain other metadata from this document.

How would you go about parsing through this to store the documents the way I want to? It would be awesome to have each paragraph stored in a separate section depending upon the subject of the paragraph.

Lastly, most of these are not strictly the same, but have many things in common. And finally, this question is not about XBRL or any quantitative data/tables, purely text. I'm using NodeJS for this.

Any help is appreciated.

JohnAllen
  • 7,317
  • 9
  • 41
  • 65
  • 1
    Where have you gotten stuck with your code? If it's not well-formed HTML (XHTML), parsing will be "fun". – WiredPrairie Oct 29 '13 at 00:51
  • 2
    *chuckle* I've spent hundreds of man-hours parsing, extracting and interpreting data from EDGAR and can assure you, it is not a trivial task. The 10-K,K/A and 10-Q,Q/A filings are actually SGML wrappers around automatically generated HTML. Regular Expressions are (or should be) your best friend in a job like this. – Rob Raisch Oct 29 '13 at 10:25
  • For parsing/recognizing/extracting things like company names, dates, locations, etc., you'll want to read up on "Named Entity Recognition" and if you're planning on going as far as recognizing/categorizing relationships between businesses, you'll want to research "Parts-Of-Speech Tagging" as well as knowledge representation frameworks like RDF. – Rob Raisch Oct 29 '13 at 10:31
  • In addition to (or in combination with) RegEx, as suggested by @RobRaisch , you can also use [probabilistic logical models](http://en.wikipedia.org/wiki/Statistical_relational_learning) like [Markov Logic Network](http://en.wikipedia.org/wiki/Markov_logic_network) and Bayesian Logic. – GuSuku Dec 11 '14 at 21:08

0 Answers0