Python - UnicodeDecodeError when trying to parse HTML file which is ASCII

Question

Am using Python 2.7.6.

Have an HTML file which contains values prepended with "$". Wrote a program which takes in JSON data and replaces the values prepended with $ with the JSON values.

This was working fine until someone opened up the set of HTML files with a different editor and changed it from UTF-8 to ASCII.

class FileUtil:
    @staticmethod
    def replace_all(output_file, data):
        homedir = os.path.expanduser("~")
        dest_dir = homedir + "/dest_dir"
        with open(output_file, "r") as my_file:
            contents = my_file.read()
        destination_file = dest_dir + "/" + data["filename"]
        fp = open(destination_file, "w")
        for key, value in data.iteritems():
            contents = contents.replace("$" + str(key), value)
        fp.write(contents)
        fp.close()

Whenever my program encounters a file which is in ASCII it throws this error:

Traceback (most recent call last):
    File "/usr/local/lib/python2.7/dist-packages/web.py-0.37-py2.7.egg/web/application.py", line 239, in process
        return self.handle()
    File "/usr/local/lib/python2.7/dist-packages/web.py-0.37-py2.7.egg/web/application.py", line 230, in handle
        return self._delegate(fn, self.fvars, args)
    File "/usr/local/lib/python2.7/dist-packages/web.py-0.37-py2.7.egg/web/application.py", line 420, in _delegate
        return handle_class(cls)
    File "/usr/local/lib/python2.7/dist-packages/web.py-0.37-py2.7.egg/web/application.py", line 396, in handle_class
        return tocall(*args)
    FileUtil.replace_all(output_file, data)
        File "/home/devuser/demo/utils/fileutils.py", line 11, in replace_all
            contents = contents.replace("$" + str(key), value)
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 54826: ordinal not in range(128)

Question(s):

Is there a way to make the contents value to be strictly UTF-8 in python?
Is it better to use a command line utility in Ubuntu Linux to convert the file before running this python script?
Is the error an encoding problem (e.g. file is ASCII and not UTF8)?

Use `codecs` to open the file and handle encoding. Fail on Unicode errors if you only want UTF-8. — Bob Dylan, Jan 12 '16 at 20:12
You could use `iconv` to convert the file, but if the file doesn't adhere to an encoding, what would you use to convert it from? For one the file isn't ASCII, because ASCII doesn't use bytes values over 127, which is exactly what the error tells you, btw. In any case, I'd consider upgrading Python to version 3, because much of the handling of encodings has improved. Lastly, search for the error message online to find out what it means, there are hundreds of similar questions here and your's doesn't add much. — Ulrich Eckhardt, Jan 12 '16 at 20:30
The problem seems to be the opposite. Your program is using the default 'ascii' codec, and the file is not ASCII. If the file was opened in Windows, it's likely it was changed to 'latin-1'. Use ``codecs.open()`` to open the file with the desired encoding. You can use packages like [chardet](https://pypi.python.org/pypi/chardet) to detect the encoding of the file. — Apalala, Jan 12 '16 at 20:32
Possible duplicate of [Python: write Unicode text to a text file?](http://stackoverflow.com/questions/6048085/python-write-unicode-text-to-a-text-file) — roeland, Jan 13 '16 at 00:28

score 0 · Answer 1 · answered Jan 13 '16 at 00:42

0

@Apalala

Thank you very much regarding chardet! It was a very useful tool.

@Ulrich Ekhardt

You are right, it is UTF-8 and not ASCII.

This was the solution:

iconv --from-code UTF-8 --to-code US-ASCII -c hello.htm > hello.html

answered Jan 13 '16 at 00:42

PacificNW_Lover

4,746
31
90
144

Python - UnicodeDecodeError when trying to parse HTML file which is ASCII

1 Answers1