I downloaded 13 000 files (10-K reports from different companies) and I need to extract a specific part of these files (section 1A- Risk factors). The problem is that I can open these files in Word easily and they are perfect, while as I open them in a normal txt editor, the document appear to be an HTML with tons of encrypted string in the end (EDIT: I suspect this is due to XBRL format of these files). Same happens as a result of using BeautifulSoup.
I've tried using online decoder, because I thought that maybe this is connected to Base64 encoding, but it seems that none of the known encoding could help me. I saw that at the beginning of some files, there is something like: "created with Certent Disclosure Management 6.31.0.1" and other programs, I thought maybe this causes the encoding. Nevertheless Word is able to open these files, so I guess there must be a known key to it. This is a sample encoded data:
M1G2RBE@MN)T='1,SC4,]%$$Q71T3<XU#[AHMB9@*E1=E_U5CKG&(77/*(LY9
ME$N9MY/U9DC,- ZY:4Z0EWF95RMQY#J!ZIB8:9RWF;\"S+1%Z*;VZPV#(MO
MUCHFYAJ'V#6O8*[R9L<VI8[I8KYQB7WSC#DMFGR[E6+;7=2R)N)1Q\24XQ(K
MYQDS$>UJ65%MV4+(KBRHJ3HFIAR76#G/F$%=*9FOU*DM-6TSTC$Q\[C$YC$/
And a sample file from the 13 000 that I downloaded.
Below I insert the BeautifulSoup that I use to extract text. It does its' job, but I need to find a clue to this encoded string and somehow decode it in the Python code below.
from bs4 import BeautifulSoup
with open("98752-TOROTEL INC-10-K-2019-07-23", "r") as f:
contents = f.read()
soup = BeautifulSoup(contents, 'html.parser')
print(soup.getText())
with open("extracted_test.txt", "w", encoding="utf-8") as f:
f.write(soup.getText())
f.close()
What I want to achieve is decoding of this dummy string in the end of the file.