0

So I am writing a program to read a webpage using urllib, then using "html2text", write the basic text to a file. However, the raw contents given from urllib.read() has various characters, so it would continuously raise UnicodeDecodeError.

I of course Googled this for 3 hours, got plenty of answers like using HTMLParser, or reload(sys), using external modules like pdfkit or BeautifulSoup, and of course .encode/.decode.

Reloading sys and then executing sys.setdefaultencoding("utf-8") grants me the desired results, but IDLE and the program becomes unresponsive after that.

I tried every variation of the .encode/.decode with 'utf-8' and 'ascii', with arguments like 'replace', 'ignore', etc. For some reason, it raises the same error everytime regardless of the arguments I supply in the encode/decode.

def download(self, url, name="WebPage.txt"):
    ## Saves only the text to file
    page = urllib.urlopen(url)
    content = page.read()
    with open(name, 'wb') as w:
        HP_inst = HTMLParser.HTMLParser()
        content = content.encode('ascii', 'xmlcharrefreplace')
        if True: 
            #w.write(HTT.html2text( (HP_inst.unescape( content ) ).encode('utf-8') ) )
            w.write( HTT.html2text( content) )#.decode('ascii', 'ignore')  ))
            w.close()
            print "Saved!"

There has to be another method or encoding I am missing... Please help!

Side Quest: I sometimes have to write it to a file where the name includes unsupported chars like "G\u00e9za Teleki"+".txt". How do I filter those characters out?

Note:

  • This function was stored inside a class (hint "self").
  • Using python2.7
  • Don't want to use BeautfiulSoup
  • Windows 8 64-bit
Chris Nguyen
  • 160
  • 1
  • 4
  • 14

2 Answers2

0

You should decode the content get from urllib with the properly encoding eg, utf-8 latin1 depends on the page you get.

The way to detect the encoding of the content are various. From headers or meta in html. I'd like to use a encoding detective module which I forget the name, you could google it.

Once you decode it properly, you can encode it to any encoding you like before write to a file

======================================

Here's the example using chardet

import urllib
import chardet


def main():
    page = urllib.urlopen('http://bbc.com')
    content = page.read()

    # detect the encoding
    try:
        encoding = chardet.detect(content)['encoding']
    except:
        # use utf-8 as default encoding
        encoding = 'utf-8'

    # decode the content into unicode
    content = content.decode(encoding)

    # write to file
    with open('test.txt', 'wb') as f:
        f.write(content.encode('utf-8'))
zephor
  • 687
  • 5
  • 13
  • Can you give an example? – Chris Nguyen Nov 28 '15 at 02:01
  • @ChrisNguyen I was not so convinient then, Here I add my example – zephor Nov 28 '15 at 02:32
  • Ohh okay, I see how encoding works... You have to decode it with its original encoded format/method?.. and is that the only way is by using an external library to detect the coding? Or is there a way without external modules? – Chris Nguyen Nov 28 '15 at 04:50
  • and how do I use chardet? I downloaded chardet.tar.gz and ran "python setup.py install" but I dont have setuptools on here... Anyway to go around this? – Chris Nguyen Nov 28 '15 at 04:58
  • follow [this](http://stackoverflow.com/questions/1449396/how-to-install-setuptools), setuptools is a basic component to install third-party modules – zephor Nov 28 '15 at 05:04
0

You have to know the encoding the remote web page is using. There are numerous ways to implement this but the easiest way is to use the Python-Requests library instead of urllib. Requests returns pre-decoded Unicode objects.

You can then use an encoding file wrapper to automatically encode every character you write.

import requests
import io

def download(self, url, name="WebPage.txt"):
    ## Saves only the text to file
    req = requests.get(url)
    content = req.text # Returns a Unicode object decoded using the server's header
    with io.open(name, 'w', encoding="utf-8") as w: # Everything written to w is encoded to UTF-8
        w.write( HTT.html2text( content) )

    print "Saved"
Alastair McCormack
  • 26,573
  • 8
  • 77
  • 100
  • Is request an external module? If so how do I get it?.. and is there anything in the default python library that can do this? – Chris Nguyen Dec 08 '15 at 11:09