So I am writing a program to read a webpage using urllib, then using "html2text", write the basic text to a file. However, the raw contents given from urllib.read() has various characters, so it would continuously raise UnicodeDecodeError.
I of course Googled this for 3 hours, got plenty of answers like using HTMLParser, or reload(sys), using external modules like pdfkit or BeautifulSoup, and of course .encode/.decode.
Reloading sys and then executing sys.setdefaultencoding("utf-8") grants me the desired results, but IDLE and the program becomes unresponsive after that.
I tried every variation of the .encode/.decode with 'utf-8' and 'ascii', with arguments like 'replace', 'ignore', etc. For some reason, it raises the same error everytime regardless of the arguments I supply in the encode/decode.
def download(self, url, name="WebPage.txt"):
## Saves only the text to file
page = urllib.urlopen(url)
content = page.read()
with open(name, 'wb') as w:
HP_inst = HTMLParser.HTMLParser()
content = content.encode('ascii', 'xmlcharrefreplace')
if True:
#w.write(HTT.html2text( (HP_inst.unescape( content ) ).encode('utf-8') ) )
w.write( HTT.html2text( content) )#.decode('ascii', 'ignore') ))
w.close()
print "Saved!"
There has to be another method or encoding I am missing... Please help!
Side Quest: I sometimes have to write it to a file where the name includes unsupported chars like "G\u00e9za Teleki"+".txt". How do I filter those characters out?
Note:
- This function was stored inside a class (hint "self").
- Using python2.7
- Don't want to use BeautfiulSoup
- Windows 8 64-bit