Unknown encoding of files in a resulting Beautiful Soup txt file

Question

I downloaded 13 000 files (10-K reports from different companies) and I need to extract a specific part of these files (section 1A- Risk factors). The problem is that I can open these files in Word easily and they are perfect, while as I open them in a normal txt editor, the document appear to be an HTML with tons of encrypted string in the end (EDIT: I suspect this is due to XBRL format of these files). Same happens as a result of using BeautifulSoup.

I've tried using online decoder, because I thought that maybe this is connected to Base64 encoding, but it seems that none of the known encoding could help me. I saw that at the beginning of some files, there is something like: "created with Certent Disclosure Management 6.31.0.1" and other programs, I thought maybe this causes the encoding. Nevertheless Word is able to open these files, so I guess there must be a known key to it. This is a sample encoded data:

M1G2RBE@MN)T='1,SC4,]%$$Q71T3<XU#[AHMB9@*E1=E_U5CKG&(77/*(LY9
ME$N9MY/U9DC,- ZY:4Z0EWF95RMQY#J!ZIB8:9RWF;\"S+1%Z*;VZPV#(MO
MUCHFYAJ'V#6O8*[R9L<VI8[I8KYQB7WSC#DMFGR[E6+;7=2R)N)1Q\24XQ(K
MYQDS$>UJ65%MV4+(KBRHJ3HFIAR76#G/F$%=*9FOU*DM-6TSTC$Q\[C$YC$/

And a sample file from the 13 000 that I downloaded.

Below I insert the BeautifulSoup that I use to extract text. It does its' job, but I need to find a clue to this encoded string and somehow decode it in the Python code below.

from bs4 import BeautifulSoup

with open("98752-TOROTEL INC-10-K-2019-07-23", "r") as f:
    contents = f.read()
    soup = BeautifulSoup(contents, 'html.parser')
    print(soup.getText())
    with open("extracted_test.txt", "w", encoding="utf-8") as f:
        f.write(soup.getText())
    f.close()

What I want to achieve is decoding of this dummy string in the end of the file.

Using your sample file, for example, what exactly is it you are trying to do that works with Word but not with BS4 or a text editor? — Jack Fleeting, Jul 31 '19 at 10:10
My goal is to get a pure txt file out of these downloaded files and to extract a specific part of it using regex. But as I use BS4 I get a lot of encoded string (Word is somehow able to decode it). I uploaded a sample file [here](https://gofile.io/?c=SKrlf6). I need to decode this data before using BS4 and I suppose these are some xlsx files that are added after an HTML structure, because this data starts with this part: `Financial_Report.xlsx IDEA: XBRL DOCUMENT begin 644 Financial_Report.xlsx` — Karolina Andruszkiewicz, Jul 31 '19 at 11:46
But ultimately you are only interested in "ITEM 1A. Risk Factors", right? — Jack Fleeting, Jul 31 '19 at 11:56
Exactly, only "ITEM 1A. Risk Factors", I was planning to extract a plain txt files out of the downloaded ones, then use regular expressions to get only this part. — Karolina Andruszkiewicz, Jul 31 '19 at 12:00
The 10k link you provided in your question (at sec.report) has a "Not Applicable" under Risk Factors. Can you add another link to the same site that has an actual Risk Factors section? — Jack Fleeting, Jul 31 '19 at 12:07
Sure, [here](https://www.sec.gov/Archives/edgar/data/1800/000104746919000624/a2237733z10-k.htm) goes another, that has it — Karolina Andruszkiewicz, Jul 31 '19 at 12:26

score 0 · Accepted Answer · answered Jul 31 '19 at 13:44

Ok, this is going to be somewhat messy, but will get you close enough to what you are looking for, without using regex (which is notoriously problematic with html). The fundamental problem you'll be facing is that EDGAR filings are VERY inconsistent in their formatting, so what may work for one 10Q (or 10K or 8K) filing may not work with a similar filing (even from the same filer...) For example, the word 'item' may appear in either lower or uppercase (or mixed), hence the use of the string.lower() method, etc. So there's going to be some cleanup, under all circumstances.

Having said that, the code below should get you the RISK FACTORS sections from both filings (including the one which has none):

url = [one of these two]

from bs4 import BeautifulSoup as bs
response = requests.get(url)
soup = bs(response.content, 'html.parser')

risks = soup.find_all('a')
for risk in risks:    
    if 'item' in str(risk.attrs).lower() and '1a' in str(risk.attrs).lower():       
        for i in risk.findAllNext(): 
            if 'item' in str(i.attrs).lower():
                break
            else:
                print(i.text.strip())

Good luck with your project!

Thanks a lot for your answer! It is a much better idea to download only "1A. Risk Factors" directly from the url rather than downloading all the files and then trying to extract this part. Could you please have a look at the EDIT that I added to my post? I am still fighting with 2 issues. — Karolina Andruszkiewicz, Aug 01 '19 at 11:31
@KarolinaAndruszkiewicz - I'll take a look at it later but meanwhile, 2 points: the script (or any script, for that matter) doesn't download only the Risk Factors section of the filing; it isn't possible on EDGAR (or anywhere else, I believe). You download the whole filing locally, and then process it with the script. Second, you may want to take the edit (and remaining issues) and post it as a new question (while reverting the current question to it's previous status); SO frowns upon multiple issues in one question, and having a new question may give more people an opportunity to give input. — Jack Fleeting, Aug 01 '19 at 11:45
Thanks for the remarks! Here I posted these issues as a new question: [link](https://stackoverflow.com/questions/57308830/extraction-of-text-using-beautiful-soup-and-regular-expressions) — Karolina Andruszkiewicz, Aug 01 '19 at 12:17

Unknown encoding of files in a resulting Beautiful Soup txt file

1 Answers1

Linked