0

I am new to Python. I am using Python 2.7.8 to parse SEC filings. The problem in my code is:

    response = urllib2.urlopen('https://www.sec.gov/Archives/edgar/data/1053507/0001193125-11-042904.txt')
    HTML = stack.strip_tags(response.read())

Note: strip_tags is defined based on HTMLParser following the following link.

But I got this error "raise HTMLParseError(message, self.getpos()) HTMLParseError: expected name token at '

I used the same code to open other SEC filings and it works fine. I googled, this link seems to be relevant. But even if I tried to replace '!' with "" before invoke strip_tags() and HTMLParseError, it still did not work out. Any idea and suggestion will be very much appreciated.

Community
  • 1
  • 1
  • What... what kind of crazy file format is this? It sure isn't HTML (even though a part of it is an awful mess of HTML) – Matti Virkkunen May 15 '15 at 23:59
  • Thanks. It's SEC Edgar filings (which use XBLR reporting language), someone calls them SGML files? I am not quite sure. For my purpose, I just need to strip the tags for now. – bessiehere May 16 '15 at 00:17
  • This is a very old format that is "pseudo" SGML. There have starting tags, but not ending tags. The actual content is broken into the "header" information and documents on the site. The individual documents have plain text, PDF, HTML, XML and XBRL depending on the form types. – Krazick Jul 14 '15 at 01:13

0 Answers0