The problem is very simple, but messy. You are not parsing HTML. You are parsing HTML wrapped in what appears to be the SEC's homegrown SGML vocabulary. Confused? Not surprised. Here's what visiting your data link, saving the file, and opening it up looks like:
<SEC-DOCUMENT>0001005214-12-000007.txt : 20120430
<SEC-HEADER>0001005214-12-000007.hdr.sgml : 20120430
<ACCEPTANCE-DATETIME>20120430163103
ACCESSION NUMBER: 0001005214-12-000007
CONFORMED SUBMISSION TYPE: 10-K
PUBLIC DOCUMENT COUNT: 12
CONFORMED PERIOD OF REPORT: 20120131
FILED AS OF DATE: 20120430
DATE AS OF CHANGE: 20120430
FILER:
COMPANY DATA:
COMPANY CONFORMED NAME: AMERICAN WAGERING INC
CENTRAL INDEX KEY: 0001005214
STANDARD INDUSTRIAL CLASSIFICATION: SERVICES-MISCELLANEOUS AMUSEMENT & RECREATION [7990]
IRS NUMBER: 880344658
STATE OF INCORPORATION: NV
FISCAL YEAR END: 0105
FILING VALUES:
FORM TYPE: 10-K
SEC ACT: 1934 Act
SEC FILE NUMBER: 000-20685
FILM NUMBER: 12795496
BUSINESS ADDRESS:
STREET 1: 675 GRIER DR
CITY: LAS VEGAS
STATE: NV
ZIP: 89119
BUSINESS PHONE: 7027350101
MAIL ADDRESS:
STREET 1: 675 GRIER DR
CITY: LAS VEGAS
STATE: NV
ZIP: 89119
</SEC-HEADER>
<DOCUMENT>
<TYPE>10-K
<SEQUENCE>1
<FILENAME>formtenk-01312012.htm
<DESCRIPTION>FORM 10 K 1.31.2012
<TEXT>
<html>
<head>
<title>formtenk-01312012.htm</title>
<!--Licensed to: American Wagering, Inc.-->
<!--Document Created using EDGARizer 2020 5.4.1.0-->
<!--Copyright 1995 - 2009 Thomson Reuters. All rights reserved.-->
</head>
<body bgcolor="#ffffff" style="DISPLAY: inline; FONT-FAMILY: Palatino Linotype; FONT-SIZE: 9pt">
<div>
Then skipping oodles of HTML lines, we pick it back up at:
</div>
</body>
</html>
</TEXT>
</DOCUMENT>
<DOCUMENT>
<TYPE>ZIP
<SEQUENCE>33
<FILENAME>0001005214-12-000007-xbrl.zip
<DESCRIPTION>IDEA: XBRL DOCUMENT
<TEXT>
begin 644 0001005214-12-000007-xbrl.zip
M4$L#!!0````(`/"#GD":H45DWI(``/X8"``1`!P`8F5T;2TR,#$R,#$S,2YX
M;6Q55`D``Z/VGD^C]IY/=7@+``$$)0X```0Y`0``[#UI;QLYEM\7V/_`T223
M!)!DE20?<HZ!XZ1[W)T+<;I[@<5B0%51$MMU+<FRK/WU^]XCZY!<\I&V$RDN
MH`]9Q>/=%TM\+_YY$87L7"@MD_AER^OV6DS$?A+(>/JRE>D.U[Z4K7^^^L__
M>/&W3N=G$0O%C0C8>,&^S))()S'[+#(#"[`CWQ<A3.G@X(NQ"AFL'>M#_"A?
So now we're out of HTML an into a string-encoded XBRL file. Then skipping a gazillon of those lines, we end up the file with:
MN?<,9P8'``"4-```$0`8```````!````I($][P``8F5T;2TR,#$R,#$S,2YX
M<V155`4``Z/VGD]U>`L``00E#@``!#D!``!02P4&``````8`!@`:`@``CO8`
#````
`
end
</TEXT>
</DOCUMENT>
<DOCUMENT>
<TYPE>XML
<SEQUENCE>34
<FILENAME>FilingSummary.xml
<DESCRIPTION>IDEA: XBRL DOCUMENT
<TEXT>
<XBRL>
<?xml version="1.0" encoding="utf-8"?>
<FilingSummary xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<Version>2.4.0.6</Version>
<ProcessingTime />
<ReportFormat>Html</ReportFormat>
<ContextCount>27</ContextCount>
<ElementCount>111</ElementCount>
<EntityCount>1</EntityCount>
<FootnotesReported>false</FootnotesReported>
<SegmentCount>5</SegmentCount>
<ScenarioCount>0</ScenarioCount>
<TuplesReported>false</TuplesReported>
<UnitCount>4</UnitCount>
<MyReports>
<Report>
<IsDefault>false</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<HtmlFileName>R1.htm</HtmlFileName>
<LongName>000100 - Document - Document and Entity Information</LongName>
<ReportType>Sheet</ReportType>
<Role>http://americanwagering.com/role/DocumentAndEntityInformation</Role>
<ShortName>Document and Entity Information</ShortName>
</Report>
<Report>
<IsDefault>true</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<HtmlFileName>R2.htm</HtmlFileName>
<LongName>010000 - Statement - CONSOLIDATED BALANCE SHEETS</LongName>
<ReportType>Sheet</ReportType>
<Role>http://americanwagering.com/role/ConsolidatedBalanceSheets</Role>
<ShortName>CONSOLIDATED BALANCE SHEETS</ShortName>
</Report>
<Report>
<IsDefault>false</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<HtmlFileName>R3.htm</HtmlFileName>
<LongName>010100 - Statement - CONSOLIDATED BALANCE SHEETS (Parenthetical)</LongName>
<ReportType>Sheet</ReportType>
<Role>http://americanwagering.com/role/ConsolidatedBalanceSheetsParenthetical</Role>
<ShortName>CONSOLIDATED BALANCE SHEETS (Parenthetical)</ShortName>
</Report>
<Report>
<IsDefault>false</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<HtmlFileName>R4.htm</HtmlFileName>
<LongName>020000 - Statement - CONSOLIDATED STATEMENTS OF OPERATIONS</LongName>
<ReportType>Sheet</ReportType>
<Role>http://americanwagering.com/role/ConsolidatedStatementsOfOperations</Role>
<ShortName>CONSOLIDATED STATEMENTS OF OPERATIONS</ShortName>
</Report>
<Report>
<IsDefault>false</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<HtmlFileName>R5.htm</HtmlFileName>
<LongName>030000 - Statement - CONSOLIDATED STATEMENTS OF STOCKHOLDERS' EQUITY (DEFICIENCY)</LongName>
<ReportType>Sheet</ReportType>
<Role>http://americanwagering.com/role/ConsolidatedStatementsOfStockholdersEquityDeficiency</Role>
<ShortName>CONSOLIDATED STATEMENTS OF STOCKHOLDERS' EQUITY (DEFICIENCY)</ShortName>
</Report>
<Report>
<IsDefault>false</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<HtmlFileName>R6.htm</HtmlFileName>
<LongName>040000 - Statement - CONSOLIDATED STATEMENTS OF CASH FLOWS</LongName>
<ReportType>Sheet</ReportType>
<Role>http://americanwagering.com/role/ConsolidatedStatementsOfCashFlows</Role>
<ShortName>CONSOLIDATED STATEMENTS OF CASH FLOWS</ShortName>
</Report>
<Report>
<IsDefault>false</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<HtmlFileName>R7.htm</HtmlFileName>
<LongName>060100 - Disclosure - Organization, Risks and Uncertainties, and Summary of Significant Accounting Policies</LongName>
<ReportType>Sheet</ReportType>
<Role>http://americanwagering.com/role/OrganizationRisksAndUncertaintiesAndSummaryOfSignificantAccountingPolicies</Role>
<ShortName>Organization, Risks and Uncertainties, and Summary of Significant Accounting Policies</ShortName>
</Report>
<Report>
<IsDefault>false</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<HtmlFileName>R8.htm</HtmlFileName>
<LongName>060200 - Disclosure - Property and Equipment</LongName>
<ReportType>Sheet</ReportType>
<Role>http://americanwagering.com/role/PropertyAndEquipment</Role>
<ShortName>Property and Equipment</ShortName>
</Report>
<Report>
<IsDefault>false</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<HtmlFileName>R9.htm</HtmlFileName>
<LongName>060300 - Disclosure - Debt</LongName>
<ReportType>Sheet</ReportType>
<Role>http://americanwagering.com/role/Debt</Role>
<ShortName>Debt</ShortName>
</Report>
<Report>
<IsDefault>false</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<HtmlFileName>R10.htm</HtmlFileName>
<LongName>060400 - Disclosure - Series A Preferred Stock</LongName>
<ReportType>Sheet</ReportType>
<Role>http://americanwagering.com/role/SeriesPreferredStock</Role>
<ShortName>Series A Preferred Stock</ShortName>
</Report>
<Report>
<IsDefault>false</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<HtmlFileName>R11.htm</HtmlFileName>
<LongName>060500 - Disclosure - Stock Options and Other Equity and Related Party Transactions</LongName>
<ReportType>Sheet</ReportType>
<Role>http://americanwagering.com/role/StockOptionsAndOtherEquityAndRelatedPartyTransactions</Role>
<ShortName>Stock Options and Other Equity and Related Party Transactions</ShortName>
</Report>
<Report>
<IsDefault>false</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<HtmlFileName>R12.htm</HtmlFileName>
<LongName>060600 - Disclosure - Commitments and Contingencies</LongName>
<ReportType>Sheet</ReportType>
<Role>http://americanwagering.com/role/CommitmentsAndContingencies</Role>
<ShortName>Commitments and Contingencies</ShortName>
</Report>
<Report>
<IsDefault>false</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<HtmlFileName>R13.htm</HtmlFileName>
<LongName>060700 - Disclosure - Related Party Transactions</LongName>
<ReportType>Sheet</ReportType>
<Role>http://americanwagering.com/role/RelatedPartyTransactions</Role>
<ShortName>Related Party Transactions</ShortName>
</Report>
<Report>
<IsDefault>false</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<HtmlFileName>R14.htm</HtmlFileName>
<LongName>060800 - Disclosure - Income Taxes</LongName>
<ReportType>Sheet</ReportType>
<Role>http://americanwagering.com/role/IncomeTaxes</Role>
<ShortName>Income Taxes</ShortName>
</Report>
<Report>
<IsDefault>false</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<HtmlFileName>R15.htm</HtmlFileName>
<LongName>060900 - Disclosure - Business Segments</LongName>
<ReportType>Sheet</ReportType>
<Role>http://americanwagering.com/role/BusinessSegments</Role>
<ShortName>Business Segments</ShortName>
</Report>
<Report>
<IsDefault>false</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<HtmlFileName>R16.htm</HtmlFileName>
<LongName>061000 - Disclosure - Additional Supplementary Cash Flow Information</LongName>
<ReportType>Sheet</ReportType>
<Role>http://americanwagering.com/role/AdditionalSupplementaryCashFlowInformation</Role>
<ShortName>Additional Supplementary Cash Flow Information</ShortName>
</Report>
<Report>
<IsDefault>false</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<HtmlFileName>R17.htm</HtmlFileName>
<LongName>061100 - Disclosure - Financial Instruments</LongName>
<ReportType>Sheet</ReportType>
<Role>http://americanwagering.com/role/FinancialInstruments</Role>
<ShortName>Financial Instruments</ShortName>
</Report>
<Report>
<IsDefault>false</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<LongName>All Reports</LongName>
<ReportType>Book</ReportType>
<ShortName>All Reports</ShortName>
</Report>
</MyReports>
<Logs>
<Log type="Info">Process Flow-Through: 010000 - Statement - CONSOLIDATED BALANCE SHEETS</Log>
<Log type="Info"> Process Flow-Through: Removing column 'Jan. 31, 2010'</Log>
<Log type="Info">Process Flow-Through: 010100 - Statement - CONSOLIDATED BALANCE SHEETS (Parenthetical)</Log>
<Log type="Info">Process Flow-Through: 020000 - Statement - CONSOLIDATED STATEMENTS OF OPERATIONS</Log>
<Log type="Info">Process Flow-Through: 040000 - Statement - CONSOLIDATED STATEMENTS OF CASH FLOWS</Log>
</Logs>
<InputFiles>
<File>betm-20120131.xml</File>
<File>betm-20120131.xsd</File>
<File>betm-20120131_cal.xml</File>
<File>betm-20120131_def.xml</File>
<File>betm-20120131_lab.xml</File>
<File>betm-20120131_pre.xml</File>
</InputFiles>
<SupplementalFiles />
<BaseTaxonomies />
<HasPresentationLinkbase>true</HasPresentationLinkbase>
<HasCalculationLinkbase>true</HasCalculationLinkbase>
</FilingSummary>
</XBRL>
</TEXT>
</DOCUMENT>
</SEC-DOCUMENT>
So all in all, you have a multipart document encoded in a text format with a header, a text section, an HTML section, an XBRL file, and a report. If you want to use the simple HTMLParser
to read it, you're going to have to strip out the HTML section first.
So, how to do that? Try a preprocess step like this:
import os
def html_part(filepath):
"""
Generator returning only the HTML lines from an
SEC Edgar SGML multi-part file.
"""
start, stop = '<html>\n', '</html>\n'
filepath = os.path.expanduser(filepath)
with open(filepath) as f:
# find start indicator, yield it
for line in f:
if line == start:
yield line
break
# yield lines until stop indicator found, yield and stop
for line in f:
yield line
if line == stop:
raise StopIteration
origpath = '0001005214-12-000007.txt'
htmlpath = origpath.replace('.txt', '.html')
with open(htmlpath, "w") as out:
out.write(''.join(html_part(origpath)))
Once you've stripped out just the HTML lines, you can use your original code to parse the
file in htmlpath
, which is truly the HTML part.