Remove ASCII-encoded binary blobs from .txt files

Question

I want to parse 10-K files (financial statements of firms). Example of Apple's can be found here (look for the .txt file). Now, I was reading this research paper (look on page 30-31) on how to parse these files. The step one is described as removing all ASCII-Encoded segments ... that's what I want to figure out on how to remove them.

I see several questions on StackOverflow on how to remove non-ASCII codes, but this is different. ASCII-Encoded segments are: All document segments with <TYPE> tags of GRAPHIC, ZIP, EXCEL and PDF - I want to delete them.

So if I load a txt file as follow:

fil = open('F:\\file.txt','r')
x = fil.read()

How can I remove all ASCII Encoded segments from this txt file? To remove HTML tags, I use the procedure here, but what about ASCII Encoded segments?

@IgnacioVazquez-Abrams Sorry I updated my question. I didn't mean tags like in HTML tags. — Plug4, Nov 05 '14 at 07:32
@Plug4: "ASCII encoded segment" is not a known term on Google, so you're going to need to explain in a lot more detail exactly what you're talking about, what you're trying to do, and why. — John Zwinck, Nov 05 '14 at 07:34
That *will* break the data and make it unusable. You don't mind that? — Jongware, Nov 05 '14 at 07:35
@Jongware by break do you mean that I can't read the file? I can give it a shot! — Plug4, Nov 05 '14 at 07:44
It will turn it from a valid *whatever* file into potentially no longer a valid *whatever* file, somewhat depending on the *whatever* specification. If all you care about is extracrting things that look like human-readable text, this should not matter. — tripleee, Nov 05 '14 at 08:32

score 1 · Accepted Answer · edited May 23 '17 at 11:57

1

If I understand you correctly, the format you are processing is somehow related to the SEC EDGAR process.

I have not taken the time to look it up formally. Perhaps you should.

From inspecting the Apple statement you link to, it looks like you want to replace anything matching the regular expression <DOCUMENT>\s*<TYPE>(?:GRAPHIC|ZIP|EXCEL|PDF).*?</DOCUMENT> with an empty string.

Disclaimer: A proper implementation would use an XML parser and extract the elements you want, instead of attempting to lexically zap things you don't want. This should not be hard in lxml.

I first thought this was XBLR but it's not. Attempting to parse it with ETree throws an exception because the close tags for some elements (including <TYPE>) appear to be optional. The best way forward would be to find out what format this is (the EDGAR site has several specifications; one of them, perhaps?) and locate a proper DTD, then proceed from there.

Once you have that sorted out, you want to see how to remove elements with XPath and perhaps how to use regex in (lxml) XPath. Then probably reimplement the other extractions you have already done using XML and XPath.

edited May 23 '17 at 11:57

Community

1
1

answered Nov 05 '14 at 07:56

tripleee

175,061
34
275
318

The standard library also contains XML parsers, they can also be useful. – Eric O. Lebigot Nov 05 '14 at 08:06
@tripleee Ah I see. So I should work with the XBLR files instead? Why I am somewhat hesitant on the XBLR files is that there are years where only txt files are available. For instance http://www.sec.gov/Archives/edgar/data/320193/0001047469-97-006960-index.html. My objective is to grab the section "MANAGEMENT'S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION AND RESULTS OF OPERATIONS" in all of these files. I will have to work harder to get this! – Plug4 Nov 05 '14 at 08:31
1

You can probably use the `.txt` files but you need a proper understanding of what format they are in. There are good reasons to not do "hit and run" regex extractions of well-defined XML formats, but if it's a quick one-off, maybe that's what you want to do in the end. However, all things counted, it's not really more work (modulo the learning curve) to do it properly, and the end result will be a lot more understandable, robust, and well-defined. – tripleee Nov 05 '14 at 08:35
By looking into more details into the txt files, I realize that I could extract all the sections that I want if I code a code that reads line by line and keeps all the text that appears between "Item 7." and "Item 8.". Then with the remaining section I can apply regex extractions and html tags strip. Now how to do that, I will have to think! Thanks for all the help – Plug4 Nov 05 '14 at 08:58

Remove ASCII-encoded binary blobs from .txt files

1 Answers1