Encoding error on Python

Question

I am executing the following code on Python:

from csv import reader, writer


def my_function(file1, file2, output, xs, stringL = 'k', delim = ','):

    with open(file1, 'r') as text, open(file2, 'r') as src, open(output, 'w') as dst:
        for l in text:
            for x in xs:
                if stringL in l:
                    print("found!")

        my_reader = reader(src, delimiter = delim)
        my_writer = writer(dst, delimiter = delim)

        columnNumber = 0
        for column in zip(*my_reader):
            print(column, columnNumber)
            columnNumber += 1


if __name__ == '__main__':
        from sys import argv
    if len(argv) == 5:
        my_function(argv[1], argv[2], argv[3], argv[4])
    elif len(argv) == 6:
        my_function(argv[1], argv[2], argv[3], argv[4], argv[5])
    elif len(argv) == 7:
        my_function(argv[1], argv[2], argv[3], argv[4], argv[5], argv[6])
    else:
        print("Invalid number of arguments")
    print("Done")

file1 is a text file like:

a
k
k
a
k
k
a
a
a
z

a
a
a

file2 is any csv file

I encounter the error:

  File "error.py", line 16, in my_function
  for column in zip(*my_reader):
  File "/usr/lib/python3.2/codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xde in position 12: invalid continuation byte

I found the same error here with a solution to it. However, I have trouble adapting this solution to my code... I tried several things like

column = unicode(column, errors = 'replace')

but it still doesn't work.

Could you please help me?

You can call a function with a list as separate arguments with the `*expression` syntax: `my_function(*argv[1:])`. That saves you a lot of code in the `__main__` block there. — Martijn Pieters, Jun 30 '13 at 21:45

Martijn Pieters · Accepted Answer · 2013-06-30T21:53:41.723

1

Python 3 opens text files by default as UTF-8 to decode to Unicode values. Your inputfile is not UTF-8 however, and decoding fails.

It is impossible to deduce from the error message or your post what the correct encoding is, but you need to find out and specify it when opening the file:

open(file2, 'r', encoding='*correct encoding for file2*', newline='') as src

Note the newline='' as well; see the csv.reader() documentation.

Your sys.argv handling is overly verbose, just use:

if __name__ == '__main__':
    from sys import argv
    if 5 <= len(argv) <=7:
        my_function(*argv[1:])
    else:
        print("Invalid number of arguments")
    print("Done")

edited Jun 30 '13 at 21:53

answered Jun 30 '13 at 21:48

Martijn Pieters

1,048,767
296
4,058
3,343

Hi Martijn, thanks for the answer. I tried several encodings (ascii, utf-16, latin-1). For ascii and utf-16, I got the same error. But for latin-1 (iso-8859-1), I have the following: for column in zip(*my_reader): _csv.Error: line contains NULL byte Does that mean that this is the correct encoding but I have another problem? – bigTree Jun 30 '13 at 22:20
@bigTree: Sounds like you got closer. NULL bytes are used in UTF-16 and UTF-32, but it sounds like it might be a different encoding still. For what it is worth, the `\xde` byte is a `Þ` character (LATIN CAPITAL LETTER THORN ) in Latin 1. Probably not what you were expecting to find.. – Martijn Pieters Jun 30 '13 at 22:21
What is \xde? I think the encoding is Latin 1 (I tried another code on my file and it worked: `str_decode = lambda x: x.decode('ISO-8859-1') def columns_average(fr,deli=";"): average = list() func = lambda x: str_decode(x).isnumeric() for column in izip(*reader(fr, delimiter=deli)): strength = [float(x) for x in ifilter(func, column)] s, n = fsum(strength), len(strength) average.append( s/n if n!=0 else 0.0 ) fr.seek(0) return map(str,average)` But I didn't have this error (about NULL Byte...) – bigTree Jun 30 '13 at 22:34
Sorry, without seeing the files it is impossible to say. – Martijn Pieters Jun 30 '13 at 22:41
You do not have a CSV file. You have a Open Document file, which is a binary format (zipped XML). – Martijn Pieters Jun 30 '13 at 22:47
Export the data to CSV from LibreOffice or OpenOffice. – Martijn Pieters Jun 30 '13 at 22:48

Encoding error on Python

1 Answers1