0

Ok im using this git from Git Bash. After i run it i have the txt files of the Securities and Exchange Commission DB which is EDGAR in this format on my hard drive. I am using Win 7. The txt files have HTML tags inside.

I was wondering since the files in text are in this strict format by the SEC agency since the early nineties if there is a way to extract a certain item let's say

<us-gaap:IncomeTaxExpenseBenefit contextRef="eol_PE9523----1310-K0013_STD_365_20131231_0" 

decimals="-3" id="id_3914012_7F3BEF88-8CD1-49E7-8A78-91A091178D1B_1_13" 

unitRef="iso4217_USD">40315000</us-gaap:IncomeTaxExpenseBenefit>

Whether by using a Script or a git repository with accuracy since the format is strict? How for instance can someone extract a hole table from the txt file? Libraries, gits, scripts anything that with a little work and modification can be picked up will be fine for me to have a start.

Can any of these gits get in and do such a job? I read the instructions (whenever there are) but i dont understand many stuff.

Luigi
  • 4,129
  • 6
  • 37
  • 57
ExoticBirdsMerchant
  • 1,466
  • 8
  • 28
  • 53
  • I think you you can find a similar question here : http://stackoverflow.com/questions/13504278/parsing-edgar-filings – Talanor Apr 21 '14 at 17:29
  • I ve seen this question the solution provided is the use of some sort of library to use it to extract data way of my strategy i have managed to download the txt files on my hard drive and now i need to extract some tables just that. I believe it is possible since the format is rigid – ExoticBirdsMerchant Apr 21 '14 at 17:31
  • @ExoticBirdsMerchant it is certainly possible, but as it stands the question is much too broad. There are plenty of HTML parsers out there for various langages; pick one and get stuck in to the documentation. – jonrsharpe Apr 21 '14 at 18:04
  • `` doesn't look like valid HTML to me. More likely it's XML or some variant/knockoff thereof... – twalberg Apr 21 '14 at 19:46
  • with what can i parse? can it be parsed with beautifulsoup – ExoticBirdsMerchant Apr 21 '14 at 20:27

1 Answers1

1

It's not HTML. It looks like XML - try using an XML parser for Python, for example ElementTree, and parsing out the relevant information. The tutorial is included on the their page.

Dropout
  • 13,653
  • 10
  • 56
  • 109
  • Could they be `SGML`? That thing dishearted me a little http://stackoverflow.com/questions/12412994/use-lxml-to-parse-text-file-with-bad-header-in-python – ExoticBirdsMerchant Apr 23 '14 at 09:11
  • Well yes - if it's XML then it's also SGML. Check out this article http://webdesign.about.com/od/sgml/a/how-are-sgml-html-and-xml-related.htm I don't usually work with SGML/XML data structures, so I'm not completely competent of answering most of the advanced questions regarding them, but I'm sure it's not HTML. – Dropout Apr 23 '14 at 09:18