I want to process a text file which contains a lot of html and uuencode characters:
For example, see the .txt file at the following link:
https://www.sec.gov/Archives/edgar/data/1522690/000121390016011794/0001213900-16-011794.txt
I am using the following code:
from bs4 import BeautifulSoup
def strip_non_ascii(string):
''' Returns the string without non ASCII characters'''
stripped = (c for c in string if 0 < ord(c) < 127)
return ''.join(stripped)
with open("C:/EDGAR/forms_to_process/10K/20160322_10-K_edgar_data_1522690_0001213900-16-011794_1.txt") as f:
lines = f.readlines()
with open("PROCESSED.txt", 'w', encoding='utf-8') as f1:
i=1
for line in lines:
soup = BeautifulSoup(line, "lxml")
print(i, "Initial line: ", line)
print(i, "Soup get text line: ", soup.get_text())
bs_line = soup.get_text()
ascii_line = strip_non_ascii(bs_line)
print(i, "Ascii line: ", ascii_line)
f1.write(ascii_line)
i=i+1
f.close()
f1.close();
which reduces the file from 8.5 MB to 2.5 MB, but it still has a lot of elements I do not need, such as:
</tr>
<tr style="vertical-align: bottom; background-color: #cceeff;">
<td
style="padding: 0px 0px 0px 10pt; text-indent: -10pt;"><font style="font-family: 'times new roman', times, serif;"> </font></td>
<td><font style="font-family: 'times new roman', times, serif;"> </font></td>
<td style="text-align: left;"><font style="font-family: 'times new roman', times, serif;"> </font></td>
<td style="text-align: right;"><font style="font-family: 'times new roman', times, serif;"> </font></td>
<td style="text-align: left;"><font style="font-family: 'times new roman', times, serif;"> </font></td>
<td><font style="font-family: 'times new roman', times, serif;"> </font></td>
<td style="text-align: left;"><font style="font-family: 'times new roman', times, serif;"> </font></td>
<td style="text-align: right;"><font style="font-family: 'times new roman', times, serif;"> </font></td>
<td style="text-align: left;"><font style="font-family: 'times new roman', times, serif;"> </font></td>
And
EXCEL
86
Financial_Report.xlsx
IDEA: XBRL DOCUMENT
begin 644 Financial_Report.xlsx
M4$L#!!0 ( J%=D@6'2-4(0( $8I 3 6T-O;G1E;G1?5'EP97-=
M+GAM;,W:2V[;,! &X*L8VA86S62DZ(U
MW")I8^#?6):'G!EII&_EJV\/@=+BX(8QK:LNY_"!L=1TY&RJ?:"Q1#8^.IO+
M:=RR8)N=W1(3JY5AC1\SC7F9IQS5]=67/<78M[3X> Q,N=>5#6'H&YM[/[+]
MV)YD7?K-IF^H]M31U1=D.=\L- Z5S]8^2I\@UM[-V07U3X\=[5D89Y3>KZ\%CJTZ%D2>6W=56B
MZ5D53C?^K;/>34,+X_:W'=/Y/U[+R4WM[KY[OWO-QX2FJVJI7898%L;M([5?MWH+T\0ZDC_M D56@2*K0)%5H,@J4&05*+(*%%D%BJP215:)(JM$D56BR"I19)4HLDH4626*
MK!)%5HDBJT*15:'(JE!D52BR*A19%8JL"D56A2*K0I%5HMBJP:15:-(JM&D56CR*I19-4HLAH460V*K 9%5H,BJT&1U:#(:E!D-2BR&A19
Is there a way to remove these and keep only the relevant textual information included in the text file?
EDIT: From the link I provided, one example of text I would like to keep is:
<P STYLE="font: 10pt/normal Times New Roman,serif; margin: 0; text-align: justify">The table above indicates the current yields
to maturity (YTM) for the senior bonds of selected life insurance carriers with durations, on average, that our similar to our
life insurance portfolio. The average yield to maturity of these bonds was 3.02% which, we believe, reflects in part the
financial market’s judgement that credit risk is low with regard to these carriers’ financial obligations. It should
be noted that the obligations of life insurance carriers to pay life insurance policy benefits is senior in rank to any other obligation.
This “super senior” priority is not reflected in the yield to maturity in the table and, if considered, would result
in a lower yield to maturity all else being equal. As such, as long as the respective premium payments have been made, it is highly
likely that the owner of the insurance policy will collect the insurance policy benefit upon the mortality of the insured.</P>
I.e. I would like to remove all the html tags and the uuencoding binary, and keep only the text.
EDIT 2:
Gerrit's response below is definitely very close to what I want to achieve, for the .txt file under consideration at least. But still, it leaves the following part at the end of the file:
Actuarial Pricing Systems, LP Model Actuarial
Pricing Systems, LP 33(Q7.U=JG''<]S7/R,ZG4BCJ0V3TKG/'&I;?V=X:N-K;9;C]RA^O4_EFG
M:==/<^*KESYJ(^GP2")\_*26SQV-%M9T2^ER$N(E=_96.&'X J:]=&,<=*\L\2V
MWB>ZTU9M7LH$M[;D-$5!4'CL3QTKH]*\07E[I&CVUFT;(NYU=))9E+!!&!G@$
M9)RO?O6N(3G%3OKL88:2IRE"SMNCL=X]*7--R3Z'/J"VI>Y=WC\L,/)7RB<9
MSR?>CD8O:)['4%@!D\#UKE_'K!O",S @CS(R#G_:%5AKUS=23VDLUO<03V<[
MI)#"Z!2HZ MPXP>HJ'Q!@?#*UQ_SRM^G_ :TI1:J1OW1E6FG3E;LS)70=)?X
M>KJDR>7>>4S"7>?F8,0!CH<]*W_AU<3/X==)22D<[)%GLN G1LGELK%Y,Q;NN>.3R>^>
MU;5IIIPO=W,*$&I1G;:RM]YUV[C.* V:YBPU'4KQX[33Q:0I;6L#R"56;<77(
M5<'@ #KS4"ZI=P2R0V,5LDL^JR6Y+[B.$SN//7CH./I7+RL[>=;G79J.;?Y;
M>7]_:=OUQQ7+2>([Z&W6;*>2TAG%Z]I)<%28U"KNW;_N9M#%]J
M6U6(=SLC*@("<'!YY S^-)Q:!33T/-/#;Z,NHW2>)(B97; >7.U7R=V[N#[F
MO3=$TO3]+LW73&+6\SF4'?O'(QP?3BN?U:#PGX@L)+\7=M%-LW"='VOTXW+W
M^A%8O@W4;^QT'5)X4\R"V>.0HP) &?W@'OMYKLJIU(.:NMM'M\CBHVI3Y&D;]
M]5^IZAFDWO7[&]EL8U>TAFCA$PC:0J-NYWV@Y8#*C ]Z;%>ZC=Z_I
MC07]M);2V;2OY<;;'PR@D#/7GCTYZURM1VNNZ[=)IKJM@HU'>D8*M^Z*Y.X\\\ \<N>*ZN*020QOO1M
MR@[D/!]Q[5+BUN4I)G@##YC]:3%.;[Q^M)7T9\M<3%:WAO\ Y#D'T;^1K*K4
M\.?\AR#Z-_(UE7_AR-B23R64-RK3D%]RNW0D\9Z=33+?1-'MM9?5HK>X%TQ9B2C[
M06ZD#'^/\ 9%.TO6[IY[:QDMY9B$03W&[.
M&9-^>@&.0*J\M7W8NRTI8_P"BQ(%V,N\*#N'S @L,@_A4
M=QX@U18YW2TAAV6;3[)F.X,'*YZ=.,]J/>[A[BOH6;K1-)NWN#(EZ([EM\T2
M&14=O[Q4=^E33Z=837+W*-?V\LBA9&MS;(GF <#=CJ??K5.;Q#=V=U/$8/M,K
M2X2)"<*!$C, 0.>3QFGR^)+M))!'IZLB-(H+SX/[M0S9&..#^='O#O THH;.
M&^>\1+CSWB6%BR.:J0Z1I4.MR:NL5R;Q\Y9E<@9&#
which seems to be the uuencoding binary part. Any idea how to get rid of this?
or in a specific section, for example identified with an id or class?
– Gerrit Verhaar Dec 01 '16 at 13:01: `soup = BeautifulSoup(open("C:/EDGAR/forms_to_process/10K/20160322_10-K_edgar_data_1522690_0001213900-16-011794_1.txt"),"html5lib") for p in soup.findAll(["p"]): labelName = div.get_text()` Soup parses the html, so that you can use the html structure to find the elements you need
– Gerrit Verhaar Dec 01 '16 at 13:08