I want to parse 10-K files (financial statements of firms). Example of Apple's can be found here (look for the .txt file). Now, I was reading this research paper (look on page 30-31) on how to parse these files. The step one is described as removing all ASCII-Encoded segments ... that's what I want to figure out on how to remove them.
I see several questions on StackOverflow on how to remove non-ASCII codes, but this is different. ASCII-Encoded segments are: All document segments with <TYPE>
tags of GRAPHIC
, ZIP
, EXCEL
and PDF
- I want to delete them.
So if I load a txt file as follow:
fil = open('F:\\file.txt','r')
x = fil.read()
How can I remove all ASCII Encoded segments from this txt file? To remove HTML tags, I use the procedure here, but what about ASCII Encoded segments?