0

I am trying to save the results of the parser.feed to a string for further parsing. But the parser.feed returns none

Here is my code:

import requests
from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_data(self, data):
        return("Encountered some data  : ", data.encode('utf-8'))

list_of_10K_text_files = ['https://www.sec.gov/Archives/edgar/data/200406/000020040616000071/0000200406-16-000071.txt', 
                      'https://www.sec.gov/Archives/edgar/data/40545/000004054516000145/0000040545-16-000145.txt', 
                      'https://www.sec.gov/Archives/edgar/data/1095130/000161577416007303/0001615774-16-007303.txt']

page = requests.get(list_of_10K_text_files[0])

parser = MyHTMLParser()

pos_Large_Acc_filer = (page.text).find('Large accelerated filer')
pos_Small_Reporting_Co = (page.text).find('Smaller reporting company')

# I would like to save the results of parser.feed to "text_for_file"
# as a string for further parsing
text_for_file = parser.feed(page.text[pos_Large_Acc_filer:(pos_Small_Reporting_Co+150)])

# Output Desired in the text_for_file variable
---------------------------------------------------------------------------
Encountered some data  :  b'Large accelerated filer\xc2\xa0\xc2\xa0'
Encountered some data  :  b'\xc3\xbe'
Encountered some data  :  b'\xc2\xa0\xc2\xa0\xc2\xa0\xc2\xa0\xc2\xa0Accelerated filer\xc2\xa0\xc2\xa0'
Encountered some data  :  b'o'
Encountered some data  :  b'\xc2\xa0\xc2\xa0\xc2\xa0\xc2\xa0\xc2\xa0Non-accelerated filer\xc2\xa0\xc2\xa0'
Encountered some data  :  b'o'
Encountered some data  :  b'\xc2\xa0\xc2\xa0\xc2\xa0\xc2\xa0\xc2\xa0Smaller reporting company\xc2\xa0\xc2\xa0'
Encountered some data  :  b'o'

Currently parser.feed returns None, but I need it to return the output, as shown above, in a format that allows me to parse that text further.

EDIT Just in case you are wondering why I am trying to parse .txt files. Below is an example of the text from the .txt files. Clearly it is HTML, besides the fist 50 or so header information lines (which I have not included).

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
    <!-- Document created using Wdesk 1 -->
    <!-- Copyright 2016 Workiva -->
    <title>10-K</title>
</head>
    <body style="font-family:Times New Roman;font-size:10pt;">
        <a name="s5971963f20334f9f9b208ef25f6cc9cd"></a>
        <div style="line-height:120%;padding-top:2px;text-align:center;font-size:12pt;">
            <font style="font-family:inherit;font-size:12pt;font-weight:bold;">UNITED STATES</font>
        </div>
        <div style="line-height:120%;text-align:center;font-size:12pt;">
            <font style="font-family:inherit;font-size:12pt;font-weight:bold;">SECURITIES AND EXCHANGE COMMISSION</font>
        </div>
        <div style="line-height:120%;text-align:center;font-size:12pt;">
            <font style="font-family:inherit;font-size:12pt;font-weight:bold;">Washington,&#160;D.C. 20549</font>
        </div> 

EDIT

Source Code for the parser can be found at the following link HTML.parser Source Code

The feed function starts at line 158. feed returns self.goahead(0) . goahead(0) function starts at line 193.

The function handle_data (source code starts at line 534) is sometimes returned by goahead but handle_data returns pass. This seems odd but might be the culprit of my particular problem.

mkultra
  • 321
  • 3
  • 9
  • How can I edit the `def handle_data(self, data):` so that the `parser.feed` does not return None? – mkultra Sep 22 '16 at 05:34
  • I don't get it: you're downloading text files and expect some HTML output? Can you try on a real html input? – Jean-François Fabre Sep 22 '16 at 06:15
  • Ok clearly I need to a better job of explaining my problem. I will update tomorrow. As for the text files I am parsing, they are written in html. So the HTML parser works on these text files. Why the SEC provides them as .txt files is beyond me. – mkultra Sep 22 '16 at 06:20
  • I'm not giving this up :) – Jean-François Fabre Sep 22 '16 at 16:45
  • Thank you I appreciate it. Would you like me to upload one of the .txt files through Gdrive or AWS S3? Or are you able to access the hyperlinks from the list in the code? – mkultra Sep 22 '16 at 16:46

1 Answers1

0

First, I would like to thank @Jean-François Fabre for his work in helping me explain and frame my question better, as well as for his work on this problem thus far.

So it turns out that one solution to my problem (found here: @WillTownes-StackOverflow is to redirect the stdout to a file like so:

temp = sys.stdout                                                             # store original stdout object for later
sys.stdout = open("Form_10K_Data.txt", "w+")                                  # redirect all prints to this log file
parser.feed(page.text[pos_Large_Acc_filer:(pos_Small_Reporting_Co+150)])      # again nothing appears. it's written to log file instead
sys.stdout.close()                                                            # ordinary file object
sys.stdout = temp                                                             # restore print commands to interactive prompt

with open("Form_10K_Data.txt") as f:
    filer_file = f.read().split('\n')[:-1]

However, this feels hacky. Is there a more pythonic solution to this problem?

Community
  • 1
  • 1
mkultra
  • 321
  • 3
  • 9