Remove html and uuencode from .txt file

Question

I want to process a text file which contains a lot of html and uuencode characters:

For example, see the .txt file at the following link:

https://www.sec.gov/Archives/edgar/data/1522690/000121390016011794/0001213900-16-011794.txt

I am using the following code:

from bs4 import BeautifulSoup

def strip_non_ascii(string):
    ''' Returns the string without non ASCII characters'''
    stripped = (c for c in string if 0 < ord(c) < 127)
    return ''.join(stripped)

with open("C:/EDGAR/forms_to_process/10K/20160322_10-K_edgar_data_1522690_0001213900-16-011794_1.txt") as f:
    lines = f.readlines()
    with open("PROCESSED.txt", 'w', encoding='utf-8') as f1:
        i=1
        for line in lines:
            soup = BeautifulSoup(line, "lxml")
            print(i, "Initial line: ", line)
            print(i, "Soup get text line: ", soup.get_text())
            bs_line = soup.get_text()
            ascii_line = strip_non_ascii(bs_line)
            print(i, "Ascii line: ", ascii_line)
            f1.write(ascii_line)
            i=i+1


f.close()
f1.close();

which reduces the file from 8.5 MB to 2.5 MB, but it still has a lot of elements I do not need, such as:

</tr>
<tr style="vertical-align: bottom; background-color: #cceeff;">
<td
style="padding: 0px 0px 0px 10pt; text-indent: -10pt;"><font style="font-family: 'times new roman', times, serif;">&#160;</font></td>
<td><font style="font-family: 'times new roman', times, serif;">&#160;</font></td>
<td style="text-align: left;"><font style="font-family: 'times new roman', times, serif;">&#160;</font></td>
<td style="text-align: right;"><font style="font-family: 'times new roman', times, serif;">&#160;</font></td>
<td style="text-align: left;"><font style="font-family: 'times new roman', times, serif;">&#160;</font></td>
<td><font style="font-family: 'times new roman', times, serif;">&#160;</font></td>
<td style="text-align: left;"><font style="font-family: 'times new roman', times, serif;">&#160;</font></td>
<td style="text-align: right;"><font style="font-family: 'times new roman', times, serif;">&#160;</font></td>
<td style="text-align: left;"><font style="font-family: 'times new roman', times, serif;">&#160;</font></td>

And

EXCEL
86
Financial_Report.xlsx
IDEA: XBRL DOCUMENT

begin 644 Financial_Report.xlsx
M4$L#!!0    (  J%=D@6'2-4(0(  $8I   3    6T-O;G1E;G1?5'EP97-=
M+GAM;,W:2V[;,! &X*L8VA86S62DZ(U
MW")I8^#?6):'G!EII&_EJV\/@=+BX(8QK:LNY_"!L=1TY&RJ?:"Q1#8^.IO+
M:=RR8)N=W1(3JY5AC1\SC7F9IQS5]=67/<78M[3X> Q,N=>5#6'H&YM[/[+]
MV)YD7?K-IF^H]M31U1=D.=\L- Z5S]8^2I\@UM[-V07U3X\=[5D89Y3>KZ\%CJTZ%D2>6W=56B
MZ5D53C?^K;/>34,+X_:W'=/Y/U[+R4WM[KY[OWO-QX2FJVJI7898%L;M([5?MWH+T\0ZDC_M D56@2*K0)%5H,@J4&05*+(*%%D%BJP215:)(JM$D56BR"I19)4HLDH4626*
MK!)%5HDBJT*15:'(JE!D52BR*A19%8JL"D56A2*K0I%5HMBJP:15:-(JM&D56CR*I19-4HLAH460V*K 9%5H,BJT&1U:#(:E!D-2BR&A19

Is there a way to remove these and keep only the relevant textual information included in the text file?

EDIT: From the link I provided, one example of text I would like to keep is:

<P STYLE="font: 10pt/normal Times New Roman,serif; margin: 0; text-align: justify">The table above indicates the current yields
to maturity (YTM) for the senior bonds of selected life insurance carriers with durations, on average, that our similar to our
life insurance portfolio.&nbsp; The average yield to maturity of these bonds was 3.02% which, we believe, reflects in part the
financial market&rsquo;s judgement that credit risk is low with regard to these carriers&rsquo; financial obligations. It should
be noted that the obligations of life insurance carriers to pay life insurance policy benefits is senior in rank to any other obligation.&nbsp;
This &ldquo;super senior&rdquo; priority is not reflected in the yield to maturity in the table and, if considered, would result
in a lower yield to maturity all else being equal. As such, as long as the respective premium payments have been made, it is highly
likely that the owner of the insurance policy will collect the insurance policy benefit upon the mortality of the insured.</P>

I.e. I would like to remove all the html tags and the uuencoding binary, and keep only the text.

EDIT 2:

Gerrit's response below is definitely very close to what I want to achieve, for the .txt file under consideration at least. But still, it leaves the following part at the end of the file:

Actuarial Pricing Systems, LP Model Actuarial
Pricing Systems, LP  33(Q7.U=JG''<]S7/R,ZG4BCJ0V3TKG/'&I;?V=X:N-K;9;C]RA^O4_EFG
M:==/<^*KESYJ(^GP2")\_*26SQV-%M9T2^ER$N(E=_96.&'X J:]=&,<=*\L\2V
MWB>ZTU9M7LH$M[;D-$5!4'CL3QTKH]*\07E[I&CVUFT;(NYU=))9E+!!&!G@$
M9)RO?O6N(3G%3OKL88:2IRE"SMNCL=X]*7--R3Z'/J"VI>Y=WC\L,/)7RB<9
MSR?>CD8O:)['4%@!D\#UKE_'K!O",S @CS(R#G_:%5AKUS=23VDLUO<03V<[
MI)#"Z!2HZ MPXP>HJ'Q!@?#*UQ_SRM^G_ :TI1:J1OW1E6FG3E;LS)70=)?X
M>KJDR>7>>4S"7>?F8,0!CH<]*W_AU<3/X==)22D<[)%GLN G1LGELK%Y,Q;NN>.3R>^>
MU;5IIIPO=W,*$&I1G;:RM]YUV[C.* V:YBPU'4KQX[33Q:0I;6L#R"56;<77(
M5<'@ #KS4"ZI=P2R0V,5LDL^JR6Y+[B.$SN//7CH./I7+RL[>=;G79J.;?Y;
M>7]_:=OUQQ7+2>([Z&W6;*>2TAG%Z]I)<%28U"KNW;_N9M#%]J
M6U6(=SLC*@("<'!YY S^-)Q:!33T/-/#;Z,NHW2>)(B97; >7.U7R=V[N#[F
MO3=$TO3]+LW73&+6\SF4'?O'(QP?3BN?U:#PGX@L)+\7=M%-LW"='VOTXW+W
M^A%8O@W4;^QT'5)X4\R"V>.0HP) &?W@'OMYKLJIU(.:NMM'M\CBHVI3Y&D;]
M]5^IZAFDWO7[&]EL8U>TAFCA$PC:0J-NYWV@Y8#*C ]Z;%>ZC=Z_I
MC07]M);2V;2OY<;;'PR@D#/7GCTYZURM1VNNZ[=)IKJM@HU'>D8*M^Z*Y.X\\\ \<N>*ZN*020QOO1M
MR@[D/!]Q[5+BUN4I)G@##YC]:3%.;[Q^M)7T9\M<3%:WAO\ Y#D'T;^1K*K4
M\.?\AR#Z-_(UE7_AR-B23R64-RK3D%]RNW0D\9Z=33+?1-'MM9?5HK>X%TQ9B2C[
M06ZD#'^/\ 9%.TO6[IY[:QDMY9B$03W&[.
M&9-^>@&.0*J\M7W8NRTI8_P"BQ(%V,N\*#N'S @L,@_A4
M=QX@U18YW2TAAV6;3[)F.X,'*YZ=.,]J/>[A[BOH6;K1-)NWN#(EZ([EM\T2
M&14=O[Q4=^E33Z=837+W*-?V\LBA9&MS;(GF <#=CJ??K5.;Q#=V=U/$8/M,K
M2X2)"<*!$C, 0.>3QFGR^)+M))!'IZLB-(H+SX/[M0S9&..#^='O#O THH;.
M&^>\1+CSWB6%BR.:J0Z1I4.MR:NL5R;Q\Y9E<@9&#

which seems to be the uuencoding binary part. Any idea how to get rid of this?

The idea of BeautifulSoup is that you filter only those elements that you need. For example: `for div in soup.findAll(["td"])` Let soup do the work. But if the filesize is a problem you could convert the file before you process it. During conversion you could filter out the elements that you don't need — Gerrit Verhaar, Nov 30 '16 at 15:19
Sorry, could you be a bit more precise? Could you write some code that would allow me to get rid of the html and the td and all that tags such as the .xlsx uuencode blob code, while keeping the text? — adrCoder, Dec 01 '16 at 11:25
could you give a sample of the text that you want to keep? (including the html around it). Is it contained in specific tags, like
or in a specific section, for example identified with an id or class? — Gerrit Verhaar, Dec 01 '16 at 13:01
sample if text in
: `soup = BeautifulSoup(open("C:/EDGAR/forms_to_process/10K/20160322_10-K_edgar_data_1522690_0001213900-16-011794_1.txt"),"html5lib") for p in soup.findAll(["p"]): labelName = div.get_text()` Soup parses the html, so that you can use the html structure to find the elements you need — Gerrit Verhaar, Dec 01 '16 at 13:08
Hi Gerrit, I edited my initial post and put an example. Hope it helps, I am looking for a way to delete from the file HTML, ASCII-encoded binary and any other embedded document structures that are not intended to be analysed. All the .txt file can be found in the link in my initial post by the way. Thanks. — adrCoder, Dec 01 '16 at 14:56
You need to build a filter on that p.get_text. Have a look at http://stackoverflow.com/questions/28610413/python-delete-uuencoding-lines/28622648#28622648 I did a quick test but couldn't get it to work with the sample you gave. — Gerrit Verhaar, Dec 02 '16 at 11:44

Gerrit Verhaar · Answer 1 · 2016-12-01T16:22:24.410

1

Instead of filtering out the unwanted text I would use soup to select the text you really need. If this text is contained in the <p> tags then:

from bs4 import BeautifulSoup
from bs4 import SoupStrainer

only_p_tags = SoupStrainer("p")

soup = BeautifulSoup(open("C:/EDGAR/forms_to_process/10K/20160322_1‌0-K_edgar_data_15226‌90_0001213900-16-011‌794_1.txt"), "html.parser", parse_only=only_p_tags)

for p in soup:
    print p.get_text()

edited Dec 01 '16 at 16:22

answered Dec 01 '16 at 16:03

Gerrit Verhaar

446
4
12

Hi Gerrit. Please see the edit in my original post. I upvoted your answer because it definitely produces something close to what I want, but it still leaves the binary uuencoding part at the end. Please see my edited post. Thank you for your help! – adrCoder Dec 02 '16 at 10:50

score 0 · Answer 2 · answered Dec 13 '16 at 01:51

When examining an SEC filing, it must be remembered that it is composed of Header data, and one or more files. The files can be of many types, like HTML, PDF, TXT, JPG, GIF, ZIP, etc.

Because file types JPG and GIF normally have "non-printable" characters they are uuencoded, and must be decoded so that the file is returned to the "proper" state for normal use.

With your example filing, the Filing Details page (https://www.sec.gov/Archives/edgar/data/1522690/000121390016011794/0001213900-16-011794-index.html) shows there are 8 HTML Pages, two Graphics (jpg), XML and XSD files. If you need to use the RAW "Accession-Number.txt" file that is the complete submission, you must parse out the individual files and perform the uudecode as part of the process.

how do I do that with which code this is the question. – adrCoder Dec 13 '16 at 14:34 — adrCoder, Dec 13 '16 at 14:34

Remove html and uuencode from .txt file

2 Answers2