1

I use the following:

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

to get rid of the HTML tags found in a text. However, for one of my file, when I do:

fdir = open('0001005214-12-000007.txt')
text = fdir.read()
strip_tags(text)

I get the following error:

    Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "G:/Dropbox/Textual/codes/Python/Parsing/Word_Count.py", line 26, in strip_tags
    s.feed(html)
  File "C:\Users\Martineau\Anaconda\lib\HTMLParser.py", line 117, in feed
    self.goahead(0)
  File "C:\Users\Martineau\Anaconda\lib\HTMLParser.py", line 169, in goahead
    k = self.parse_html_declaration(i)
  File "C:\Users\Martineau\Anaconda\lib\HTMLParser.py", line 245, in parse_html_declaration
    return self.parse_marked_section(i)
  File "C:\Users\Martineau\Anaconda\lib\markupbase.py", line 160, in parse_marked_section
    self.error('unknown status keyword %r in marked section' % rawdata[i+3:j])
  File "C:\Users\Martineau\Anaconda\lib\HTMLParser.py", line 124, in error
    raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: unknown status keyword 't\n' in marked section, at line 35210, column 58

What does this error mean? How can I bypass this error?

The actual file that I want to parse is this one

Plug4
  • 3,838
  • 9
  • 51
  • 79
  • I'd assume it hit some invalid markup. You could either try and catch the error or feed it through beautifulsoup beforehand. – Peter Nov 17 '14 at 01:42

1 Answers1

5

The problem is very simple, but messy. You are not parsing HTML. You are parsing HTML wrapped in what appears to be the SEC's homegrown SGML vocabulary. Confused? Not surprised. Here's what visiting your data link, saving the file, and opening it up looks like:

    <SEC-DOCUMENT>0001005214-12-000007.txt : 20120430
    <SEC-HEADER>0001005214-12-000007.hdr.sgml : 20120430
    <ACCEPTANCE-DATETIME>20120430163103
    ACCESSION NUMBER:       0001005214-12-000007
    CONFORMED SUBMISSION TYPE:  10-K
    PUBLIC DOCUMENT COUNT:      12
    CONFORMED PERIOD OF REPORT: 20120131
    FILED AS OF DATE:       20120430
    DATE AS OF CHANGE:      20120430

    FILER:

        COMPANY DATA:   
            COMPANY CONFORMED NAME:         AMERICAN WAGERING INC
            CENTRAL INDEX KEY:          0001005214
            STANDARD INDUSTRIAL CLASSIFICATION: SERVICES-MISCELLANEOUS AMUSEMENT & RECREATION [7990]
            IRS NUMBER:             880344658
            STATE OF INCORPORATION:         NV
            FISCAL YEAR END:            0105

        FILING VALUES:
            FORM TYPE:      10-K
            SEC ACT:        1934 Act
            SEC FILE NUMBER:    000-20685
            FILM NUMBER:        12795496

        BUSINESS ADDRESS:   
            STREET 1:       675 GRIER DR
            CITY:           LAS VEGAS
            STATE:          NV
            ZIP:            89119
            BUSINESS PHONE:     7027350101

        MAIL ADDRESS:   
            STREET 1:       675 GRIER DR
            CITY:           LAS VEGAS
            STATE:          NV
            ZIP:            89119
    </SEC-HEADER>
    <DOCUMENT>
    <TYPE>10-K
    <SEQUENCE>1
    <FILENAME>formtenk-01312012.htm
    <DESCRIPTION>FORM 10 K 1.31.2012
    <TEXT>
    <html>
    <head>
        <title>formtenk-01312012.htm</title>
        <!--Licensed to: American Wagering, Inc.-->
        <!--Document Created using EDGARizer 2020 5.4.1.0-->
        <!--Copyright 1995 - 2009 Thomson Reuters. All rights reserved.-->
    </head>
    <body bgcolor="#ffffff" style="DISPLAY: inline; FONT-FAMILY: Palatino Linotype; FONT-SIZE: 9pt">
    <div>

Then skipping oodles of HTML lines, we pick it back up at:

    </div>
  </body>
</html>
</TEXT>
</DOCUMENT>
<DOCUMENT>
<TYPE>ZIP
<SEQUENCE>33
<FILENAME>0001005214-12-000007-xbrl.zip
<DESCRIPTION>IDEA: XBRL DOCUMENT
<TEXT>
begin 644 0001005214-12-000007-xbrl.zip
M4$L#!!0````(`/"#GD":H45DWI(``/X8"``1`!P`8F5T;2TR,#$R,#$S,2YX
M;6Q55`D``Z/VGD^C]IY/=7@+``$$)0X```0Y`0``[#UI;QLYEM\7V/_`T223
M!)!DE20?<HZ!XZ1[W)T+<;I[@<5B0%51$MMU+<FRK/WU^]XCZY!<\I&V$RDN
MH`]9Q>/=%TM\+_YY$87L7"@MD_AER^OV6DS$?A+(>/JRE>D.U[Z4K7^^^L__
M>/&W3N=G$0O%C0C8>,&^S))()S'[+#(#"[`CWQ<A3.G@X(NQ"AFL'>M#_"A?

So now we're out of HTML an into a string-encoded XBRL file. Then skipping a gazillon of those lines, we end up the file with:

    MN?<,9P8'``"4-```$0`8```````!````I($][P``8F5T;2TR,#$R,#$S,2YX
    M<V155`4``Z/VGD]U>`L``00E#@``!#D!``!02P4&``````8`!@`:`@``CO8`
    #````
    `
    end

    </TEXT>
    </DOCUMENT>
    <DOCUMENT>
    <TYPE>XML
    <SEQUENCE>34
    <FILENAME>FilingSummary.xml
    <DESCRIPTION>IDEA: XBRL DOCUMENT
    <TEXT>
    <XBRL>
    <?xml version="1.0" encoding="utf-8"?>
    <FilingSummary xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
      <Version>2.4.0.6</Version>
      <ProcessingTime />
      <ReportFormat>Html</ReportFormat>
      <ContextCount>27</ContextCount>
      <ElementCount>111</ElementCount>
      <EntityCount>1</EntityCount>
      <FootnotesReported>false</FootnotesReported>
      <SegmentCount>5</SegmentCount>
      <ScenarioCount>0</ScenarioCount>
      <TuplesReported>false</TuplesReported>
      <UnitCount>4</UnitCount>
      <MyReports>
        <Report>
          <IsDefault>false</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <HtmlFileName>R1.htm</HtmlFileName>
          <LongName>000100 - Document - Document and Entity Information</LongName>
          <ReportType>Sheet</ReportType>
          <Role>http://americanwagering.com/role/DocumentAndEntityInformation</Role>
          <ShortName>Document and Entity Information</ShortName>
        </Report>
        <Report>
          <IsDefault>true</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <HtmlFileName>R2.htm</HtmlFileName>
          <LongName>010000 - Statement - CONSOLIDATED BALANCE SHEETS</LongName>
          <ReportType>Sheet</ReportType>
          <Role>http://americanwagering.com/role/ConsolidatedBalanceSheets</Role>
          <ShortName>CONSOLIDATED BALANCE SHEETS</ShortName>
        </Report>
        <Report>
          <IsDefault>false</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <HtmlFileName>R3.htm</HtmlFileName>
          <LongName>010100 - Statement - CONSOLIDATED BALANCE SHEETS (Parenthetical)</LongName>
          <ReportType>Sheet</ReportType>
          <Role>http://americanwagering.com/role/ConsolidatedBalanceSheetsParenthetical</Role>
          <ShortName>CONSOLIDATED BALANCE SHEETS (Parenthetical)</ShortName>
        </Report>
        <Report>
          <IsDefault>false</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <HtmlFileName>R4.htm</HtmlFileName>
          <LongName>020000 - Statement - CONSOLIDATED STATEMENTS OF OPERATIONS</LongName>
          <ReportType>Sheet</ReportType>
          <Role>http://americanwagering.com/role/ConsolidatedStatementsOfOperations</Role>
          <ShortName>CONSOLIDATED STATEMENTS OF OPERATIONS</ShortName>
        </Report>
        <Report>
          <IsDefault>false</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <HtmlFileName>R5.htm</HtmlFileName>
          <LongName>030000 - Statement - CONSOLIDATED STATEMENTS OF STOCKHOLDERS' EQUITY (DEFICIENCY)</LongName>
          <ReportType>Sheet</ReportType>
          <Role>http://americanwagering.com/role/ConsolidatedStatementsOfStockholdersEquityDeficiency</Role>
          <ShortName>CONSOLIDATED STATEMENTS OF STOCKHOLDERS' EQUITY (DEFICIENCY)</ShortName>
        </Report>
        <Report>
          <IsDefault>false</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <HtmlFileName>R6.htm</HtmlFileName>
          <LongName>040000 - Statement - CONSOLIDATED STATEMENTS OF CASH FLOWS</LongName>
          <ReportType>Sheet</ReportType>
          <Role>http://americanwagering.com/role/ConsolidatedStatementsOfCashFlows</Role>
          <ShortName>CONSOLIDATED STATEMENTS OF CASH FLOWS</ShortName>
        </Report>
        <Report>
          <IsDefault>false</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <HtmlFileName>R7.htm</HtmlFileName>
          <LongName>060100 - Disclosure - Organization, Risks and Uncertainties, and Summary of Significant Accounting Policies</LongName>
          <ReportType>Sheet</ReportType>
          <Role>http://americanwagering.com/role/OrganizationRisksAndUncertaintiesAndSummaryOfSignificantAccountingPolicies</Role>
          <ShortName>Organization, Risks and Uncertainties, and Summary of Significant Accounting Policies</ShortName>
        </Report>
        <Report>
          <IsDefault>false</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <HtmlFileName>R8.htm</HtmlFileName>
          <LongName>060200 - Disclosure - Property and Equipment</LongName>
          <ReportType>Sheet</ReportType>
          <Role>http://americanwagering.com/role/PropertyAndEquipment</Role>
          <ShortName>Property and Equipment</ShortName>
        </Report>
        <Report>
          <IsDefault>false</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <HtmlFileName>R9.htm</HtmlFileName>
          <LongName>060300 - Disclosure - Debt</LongName>
          <ReportType>Sheet</ReportType>
          <Role>http://americanwagering.com/role/Debt</Role>
          <ShortName>Debt</ShortName>
        </Report>
        <Report>
          <IsDefault>false</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <HtmlFileName>R10.htm</HtmlFileName>
          <LongName>060400 - Disclosure - Series A Preferred Stock</LongName>
          <ReportType>Sheet</ReportType>
          <Role>http://americanwagering.com/role/SeriesPreferredStock</Role>
          <ShortName>Series A Preferred Stock</ShortName>
        </Report>
        <Report>
          <IsDefault>false</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <HtmlFileName>R11.htm</HtmlFileName>
          <LongName>060500 - Disclosure - Stock Options and Other Equity and Related Party Transactions</LongName>
          <ReportType>Sheet</ReportType>
          <Role>http://americanwagering.com/role/StockOptionsAndOtherEquityAndRelatedPartyTransactions</Role>
          <ShortName>Stock Options and Other Equity and Related Party Transactions</ShortName>
        </Report>
        <Report>
          <IsDefault>false</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <HtmlFileName>R12.htm</HtmlFileName>
          <LongName>060600 - Disclosure - Commitments and Contingencies</LongName>
          <ReportType>Sheet</ReportType>
          <Role>http://americanwagering.com/role/CommitmentsAndContingencies</Role>
          <ShortName>Commitments and Contingencies</ShortName>
        </Report>
        <Report>
          <IsDefault>false</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <HtmlFileName>R13.htm</HtmlFileName>
          <LongName>060700 - Disclosure - Related Party Transactions</LongName>
          <ReportType>Sheet</ReportType>
          <Role>http://americanwagering.com/role/RelatedPartyTransactions</Role>
          <ShortName>Related Party Transactions</ShortName>
        </Report>
        <Report>
          <IsDefault>false</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <HtmlFileName>R14.htm</HtmlFileName>
          <LongName>060800 - Disclosure - Income Taxes</LongName>
          <ReportType>Sheet</ReportType>
          <Role>http://americanwagering.com/role/IncomeTaxes</Role>
          <ShortName>Income Taxes</ShortName>
        </Report>
        <Report>
          <IsDefault>false</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <HtmlFileName>R15.htm</HtmlFileName>
          <LongName>060900 - Disclosure - Business Segments</LongName>
          <ReportType>Sheet</ReportType>
          <Role>http://americanwagering.com/role/BusinessSegments</Role>
          <ShortName>Business Segments</ShortName>
        </Report>
        <Report>
          <IsDefault>false</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <HtmlFileName>R16.htm</HtmlFileName>
          <LongName>061000 - Disclosure - Additional Supplementary Cash Flow Information</LongName>
          <ReportType>Sheet</ReportType>
          <Role>http://americanwagering.com/role/AdditionalSupplementaryCashFlowInformation</Role>
          <ShortName>Additional Supplementary Cash Flow Information</ShortName>
        </Report>
        <Report>
          <IsDefault>false</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <HtmlFileName>R17.htm</HtmlFileName>
          <LongName>061100 - Disclosure - Financial Instruments</LongName>
          <ReportType>Sheet</ReportType>
          <Role>http://americanwagering.com/role/FinancialInstruments</Role>
          <ShortName>Financial Instruments</ShortName>
        </Report>
        <Report>
          <IsDefault>false</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <LongName>All Reports</LongName>
          <ReportType>Book</ReportType>
          <ShortName>All Reports</ShortName>
        </Report>
      </MyReports>
      <Logs>
        <Log type="Info">Process Flow-Through: 010000 - Statement - CONSOLIDATED BALANCE SHEETS</Log>
        <Log type="Info">   Process Flow-Through: Removing column 'Jan. 31, 2010'</Log>
        <Log type="Info">Process Flow-Through: 010100 - Statement - CONSOLIDATED BALANCE SHEETS (Parenthetical)</Log>
        <Log type="Info">Process Flow-Through: 020000 - Statement - CONSOLIDATED STATEMENTS OF OPERATIONS</Log>
        <Log type="Info">Process Flow-Through: 040000 - Statement - CONSOLIDATED STATEMENTS OF CASH FLOWS</Log>
      </Logs>
      <InputFiles>
        <File>betm-20120131.xml</File>
        <File>betm-20120131.xsd</File>
        <File>betm-20120131_cal.xml</File>
        <File>betm-20120131_def.xml</File>
        <File>betm-20120131_lab.xml</File>
        <File>betm-20120131_pre.xml</File>
      </InputFiles>
      <SupplementalFiles />
      <BaseTaxonomies />
      <HasPresentationLinkbase>true</HasPresentationLinkbase>
      <HasCalculationLinkbase>true</HasCalculationLinkbase>
    </FilingSummary>
    </XBRL>
    </TEXT>
    </DOCUMENT>
    </SEC-DOCUMENT>

So all in all, you have a multipart document encoded in a text format with a header, a text section, an HTML section, an XBRL file, and a report. If you want to use the simple HTMLParser to read it, you're going to have to strip out the HTML section first.

So, how to do that? Try a preprocess step like this:

import os

def html_part(filepath):
    """
    Generator returning only the HTML lines from an
    SEC Edgar SGML multi-part file.
    """
    start, stop = '<html>\n', '</html>\n'
    filepath = os.path.expanduser(filepath)
    with open(filepath) as f:
        # find start indicator, yield it
        for line in f:
            if line == start:
                yield line
                break
        # yield lines until stop indicator found, yield and stop
        for line in f:
            yield line
            if line == stop:
                raise StopIteration


origpath = '0001005214-12-000007.txt'
htmlpath = origpath.replace('.txt', '.html')

with open(htmlpath, "w") as out:
    out.write(''.join(html_part(origpath)))

Once you've stripped out just the HTML lines, you can use your original code to parse the file in htmlpath, which is truly the HTML part.

Jonathan Eunice
  • 21,653
  • 6
  • 75
  • 77
  • Wow! Fantastic answer. I now understand the issue. Thanks for the help! – Plug4 Nov 17 '14 at 03:40
  • `raise StopIteration` now breaks the loop after some updates ([source](https://stackoverflow.com/a/51701040)). Changing to `continue` fixed the problem for me. – rmd0001 Jun 16 '23 at 10:36