How to remove weird encoding from txt file

Question

I am trying to process text files like this one:

http://www.sec.gov/Archives/edgar/data/789019/000119312514289961/0001193125-14-289961.txt

If you see around the middle of the file there is something like the following:

</TEXT>
</DOCUMENT>
<DOCUMENT>
<TYPE>EXCEL
<SEQUENCE>21
<FILENAME>Financial_Report.xlsx
<DESCRIPTION>IDEA: XBRL DOCUMENT
<TEXT>
begin 644 Financial_Report.xlsx
M4$L#!!0`!@`(````(0!):[_C#0,``+!)```3``@"6T-O;G1E;G1?5'EP97-=
M+GAM;""B!`(HH``"````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M``````````````````````````````````````#,W,M.VT`4QO%]I;Z#Y6V5
M>([OK@@L>EFV2*4/,+4GQ,(W>08*;]^)N0BA%(2*U/^&B,2>\\6+G[+YSM')
M==\%5V:V[3AL0EFK,#!#/3;M<+X)?YY]795A8)T>&MV-@]F$-\:&)\?OWQV=
MW4S&!O[NP6["G7/3QRBR]<[TVJ['R0S^D^TX]]KY?^?S:-+UA3XW4:Q4'M7C
MX,S@5FY_1GA\]-EL]67G@B_7_NW;)+/I;!A\NKUP/VL3ZFGJVEH[GS2Z&IHG
M4U9W$];^SN4:NVLG^\''"*.#$_:?_'W`W7W?_:.9V\8$IWIVWW3O8T377?1[
MG"]^C>/%^OE##J0<M]NV-LU87_;^":SM-!O=V)TQKN_6R^NZU^UPG_N9^<O%
M-EI>Y(V#[+_?<O`K<\20'`DD1PK)D4%RY)`<!21'"<E107*(H@2AB"H44H5B
MJE!0%8JJ0F%5**X*!5:AR!I39(TILL8466.*K#%%UI@B:TR1-:;(&E-DC2FR
M)A19$XJL"476A")K0I$UH<B:4&1-*+(F%%D3BJPI1=:4(FM*D36ER)I29$TI
MLJ8465.*K"E%UI0B:T:1-:/(FE%DS2BR9A19,XJL&476C")K1I$UH\B:4V3-
M*;+F%%ESBJPY1=:<(FM.D36GR)I39,TILA8460N*K`5%UH(B:T&1M:#(6E!D
M+2BR%A19"XJL)476DB)K29&UI,A:4F0M*;*6%%E+BJPE1=:2(FM%D;6BR%I1
M9*THLE8462N*K!5%UHHB:T61M:+(*HI"JRB*K:(HN(JBZ"J*PJLHBJ^B*,"*
MH@@KBD*L*(RQH#H6QEA.(8O3R.)4LCB=+$XIB]/*XM2R,+TLP12S!-/,$DPU
M2S#=+,&4LP33SA),/4LP_2S!%+0$T]"2_U;1<GX?CHF6O__^`W8YYH6%+-;=
M=,:^\1*%VT-?FKS3LVE^N-EO#GKS`(_/?BZ'WZMS.H^3]1N&9O/ZIW"_0FA_
M]VKR!YG9M>9AB="A93P/$_UVHM</?+(-R.SW'S6F.3`[6O8M'?\!``#__P,`
M4$L#!!0`!@`(````(0"U53`C]0```$P"```+``@"7W)E;',O+G)E;',@H@0"

This seems like an excel file ? Or an XBRL document ? What is that ? How do I get rid of it (or "process" it somehow??) This goes on for thousands of lines so I guess it is some encoding of some link of some attached file?? Any idea how to deal with it ?

I am trying to use BeautifulSoup in Python:

from bs4 import BeautifulSoup

with open("textWithHtml.txt") as markup:
    soup = BeautifulSoup(markup.read())

with open("processedText.txt", "w") as f: 
    f.write(soup.get_text().encode('utf-8'))

but not everything is removed, and also I noticed that in some cases not even all html tags are removed.. sometimes running the code twice removes more than what were removed the first time the BeautifulSoup code was run..

pandas provides a way of reading Excel files into a useful structure. See http://pandas.pydata.org/pandas-docs/stable/io.html#io-excel. As you say, you'll have to be careful with encoding with this piece of the page. — jlb83, Feb 19 '15 at 11:31
Looks like this file format is a XML wrapper around the original file with some metadata fields and encoding the payload in some 7 bit encoding. — Paulo Scardine, Feb 19 '15 at 20:49

score 1 · Answer 1 · edited May 23 '17 at 10:24

1

The encoding you are looking at is uuencode. In Python, you would use the uu module to decode this blob, or simply stringdata.decode('uu').

uuencode is a legacy format which was originally used to embed binaries in email (which then only permitted 7-bit US-ASCII; the format also has some concessions for interoperability with big-iron systems of the day which used their own bewildering character encodings). These days, you would expect to see base64 in this role.

I posted an answer to the followup question which shows how to remove uuencode blobs while reading from a filehandle or iterating over a bunch of lines of text.

edited May 23 '17 at 10:24

Community

1
1

answered Feb 19 '15 at 14:37

tripleee

175,061
34
275
318

Thanks tripleee this is very interesting and I was not aware of this uuencode thing. The problem is I have to process hundreds of these files and I do not know how many of these exist in each file. Is it possible that I somehow track these automatically with Python and decode them or whatever? From what I have seen these appear on .xlsx, .zip, .pdf, .jpg, .png etc.. What would I get if I decode these? – adrCoder Feb 19 '15 at 14:47
Decoding a zip file gets you the binary bytes which would be inside of a zip file, ditto ditto for Excel, PDF, etc etc. It's a general-purpose container format so there is no way to predict exactly what it will contain (nor should you rely on the file name extension, even if it's usually a good hint). – tripleee Feb 19 '15 at 14:49
If you omit the `decoded` parameter, the `uu` library will write the decoded content to disk using the file name provided on the `begin` line. Maybe you would prefer that. – tripleee Feb 19 '15 at 14:50
I think it would be best if I get rid of them but is there a way to do so programmatically? – adrCoder Feb 19 '15 at 14:50
The `begin` line is followed by zero or more lines of fixed length, all with the first character `M`. The last data line can be shorter and then has a different prefix, and is followed by two lines, one containing just `\`` (backtick, ASCII 96) and the other containing only `end`. See also the Wikipedia link I added to the anwer. – tripleee Feb 19 '15 at 14:51
Yes I see you are right about the M and all that. Any idea how to delete these using Python ? – adrCoder Feb 19 '15 at 14:55
Erm, yes? Write a simple parser as per the spec above. But this is getting out of hand for a comment chain -- post a new question if you can't figure it out. – tripleee Feb 19 '15 at 14:56
Ok I will try and if I can't (very possible :) ) I will post a new question. – adrCoder Feb 19 '15 at 14:57
But, does the containing XML format (something XBRL-related?) really even permit anything except a complete UUencoded blob in the `` container in the first place? Probably you can throw away the entire container if it begins with a `begin 644 filename` signature. – tripleee Feb 19 '15 at 14:59
I just checked 3 or 4 different documents and they all begin this section with begin 664 filename -- it can be .pdf or .jpg or .xsl or whatever... so maybe we can delete everything after the begin 664 or something ? – adrCoder Feb 19 '15 at 15:01
I asked a new question please check it out. Thanks – adrCoder Feb 19 '15 at 15:37
http://stackoverflow.com/questions/28610413/python-delete-uuencoding-lines apparently. – tripleee Feb 19 '15 at 15:41

score 0 · Accepted Answer · edited May 23 '17 at 11:55

0

The problem can efficiently be solved using the sed command as provided here : sed command - apply in all text (.txt) files of folder

edited May 23 '17 at 11:55

Community

1
1

answered Feb 19 '15 at 20:41

adrCoder

3,145
4
31
56

Unfortunately, that removes through to the end of file, not just to the end of the uuencode blob. – tripleee Feb 20 '15 at 05:58
I'm afraid the sample is partial. You would expect the XML structure to continue, at least with a closing `` tag but more realistically with a large collection of documents, many of which might also be uuencoded. Click through to the EDGAR site to see what these really look like. – tripleee Feb 20 '15 at 09:59
i copied and pasted whatever is there.. maybe the sample is partial because I previously use Beautifulsoup to remove html tags which results in removing parts of the uuencoding as well ?? – adrCoder Feb 20 '15 at 10:07
Yes, you should keep the XML structure and use BeautifulSoup to iterate through the tree, not just discard the tree structure. Then you could also simply discard a `` element whose body starts with a `begin 6xx filename` signature, and proceed to the next element. – tripleee Feb 20 '15 at 10:09
Ok, I don't know how to do what you're saying but I will check it out. What I am doing with beautifulsoup is this : http://stackoverflow.com/questions/28608072/python-beautifulsoup-apply-in-every-text-file-in-folder-and-produce-new-text which might be wrong or incomplete or inappropriate. Feel free to suggest some alternative if possible. – adrCoder Feb 20 '15 at 10:10

How to remove weird encoding from txt file

2 Answers2

Linked