Extracting data from txt files

Question

Ok im using this git from Git Bash. After i run it i have the txt files of the Securities and Exchange Commission DB which is EDGAR in this format on my hard drive. I am using Win 7. The txt files have HTML tags inside.

I was wondering since the files in text are in this strict format by the SEC agency since the early nineties if there is a way to extract a certain item let's say

<us-gaap:IncomeTaxExpenseBenefit contextRef="eol_PE9523----1310-K0013_STD_365_20131231_0" 

decimals="-3" id="id_3914012_7F3BEF88-8CD1-49E7-8A78-91A091178D1B_1_13" 

unitRef="iso4217_USD">40315000</us-gaap:IncomeTaxExpenseBenefit>

Whether by using a Script or a git repository with accuracy since the format is strict? How for instance can someone extract a hole table from the txt file? Libraries, gits, scripts anything that with a little work and modification can be picked up will be fine for me to have a start.

Can any of these gits get in and do such a job? I read the instructions (whenever there are) but i dont understand many stuff.

I think you you can find a similar question here : http://stackoverflow.com/questions/13504278/parsing-edgar-filings — Talanor, Apr 21 '14 at 17:29
I ve seen this question the solution provided is the use of some sort of library to use it to extract data way of my strategy i have managed to download the txt files on my hard drive and now i need to extract some tables just that. I believe it is possible since the format is rigid — ExoticBirdsMerchant, Apr 21 '14 at 17:31
@ExoticBirdsMerchant it is certainly possible, but as it stands the question is much too broad. There are plenty of HTML parsers out there for various langages; pick one and get stuck in to the documentation. — jonrsharpe, Apr 21 '14 at 18:04
`` doesn't look like valid HTML to me. More likely it's XML or some variant/knockoff thereof... — twalberg, Apr 21 '14 at 19:46

score 1 · Accepted Answer · answered Apr 23 '14 at 09:05

1

It's not HTML. It looks like XML - try using an XML parser for Python, for example ElementTree, and parsing out the relevant information. The tutorial is included on the their page.

answered Apr 23 '14 at 09:05

Dropout

13,653
10
56
109

Could they be `SGML`? That thing dishearted me a little http://stackoverflow.com/questions/12412994/use-lxml-to-parse-text-file-with-bad-header-in-python – ExoticBirdsMerchant Apr 23 '14 at 09:11
Well yes - if it's XML then it's also SGML. Check out this article http://webdesign.about.com/od/sgml/a/how-are-sgml-html-and-xml-related.htm I don't usually work with SGML/XML data structures, so I'm not completely competent of answering most of the advanced questions regarding them, but I'm sure it's not HTML. – Dropout Apr 23 '14 at 09:18

Extracting data from txt files

1 Answers1