I'm assuming that I have a database of the desired SEC filings (Form 10s initially). Most of the files are HTML tags; they look like this:
<DOCUMENT>
<TYPE>10-K
<SEQUENCE>1
<FILENAME>d445434d10k.htm
<DESCRIPTION>FORM 10-K
<TEXT>
<HTML><HEAD>
<TITLE>Form 10-K</TITLE>
</HEAD>
<BODY BGCOLOR="WHITE">
<h5 align="left"><a href="#toc">Table of Contents</a></h5>
<div style="line-height:120%;font-size:8pt;"><font style="font-family:inherit;font-size:8pt;"> </font></div><div style="line-height:120%;text-indent:32px;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;font-style:italic;">All references in this Form 10-K to the “Company”, “Contango”, “we”, “us” or “our” are to Contango Oil & Gas Company and wholly-owned Subsidiaries. Unless otherwise noted, all information in this Form 10-K relating to natural gas and oil reserves and the estimated future net cash flows attributable to those reserves are based on estimates prepared by independent engineers and are net to our interest.</font></div>
I want to eventually get each filing into sections in a database.
For example this:
Overview</font></div><div style="line-height:120%;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;"><br></font></div><div style="line-height:120%;text-indent:48px;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;">Contango is a Houston, Texas based, independent natural gas and oil company.  The Company's core business is to explore, develop, produce and acquire natural gas and oil properties offshore in the shallow waters of the Gulf of Mexico.  Contango Operators, Inc. (“COI”), our wholly-owned subsidiary, acts as operator of our offshore properties. Contango has additional onshore investments in i) Alta Resources Investments, LLC ("Alta"), whose primary area of focus is the liquids-rich Kaybob Duvernay in Alberta, Canada; ii) Exaro Energy III LLC ("Exaro"), which is primarily focused on the development of proved natural gas reserves.
Into:
Overview Contango is a Houston, Texas based, independent natural gas and oil company.  The Company's core business is to explore, develop, produce and acquire natural gas and oil properties offshore in the shallow waters of the Gulf of Mexico.  Contango Operators, Inc. (“COI”), our wholly-owned subsidiary, acts as operator of our offshore properties. Contango has additional onshore investments in i) Alta Resources Investments, LLC ("Alta"), whose primary area of focus is the liquids-rich Kaybob Duvernay in Alberta, Canada; ii) Exaro Energy III LLC ("Exaro"), which is primarily focused on the development of proved natural gas reserves
...and be able to call each section up into custom views (making custom and condensed versions; say just the Item 1. Business and Segment Information), get rid of the boilerplate stuff. My model will have the type, filename and certain other metadata from this document.
How would you go about parsing through this to store the documents the way I want to? It would be awesome to have each paragraph stored in a separate section depending upon the subject of the paragraph.
Lastly, most of these are not strictly the same, but have many things in common. And finally, this question is not about XBRL or any quantitative data/tables, purely text. I'm using NodeJS for this.
Any help is appreciated.