Terrible "invalid start byte" Unicode Error with Opening a CSV file

Question

Please please please help. I've been strugglign with this for a while and ran into problem after problem. I'm just trying to make a loop that opens every csv file in a folder. Here's the loop:

folder = '/Users/jolijttamanaha/Documents/Senior/Thesis/Python/TextAnalysis/datedmatchedngrams2/'

for file in os.listdir (folder):
    with codecs.open(file, mode='rU', encoding='utf-8') as f:
        m=min(int(line[1]) for line in csv.reader(f))
        f.seek(0)
        for line in csv.reader(f):
            if int(line[1])==m:
                print line

Here's the error:

Traceback (most recent call last):
  File "findfirsttrigram.py", line 11, in <module>
    m=min(int(line[1]) for line in csv.reader(f))
  File "findfirsttrigram.py", line 11, in <genexpr>
    m=min(int(line[1]) for line in csv.reader(f))
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 684, in next
    return self.reader.next()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 615, in next
    line = self.readline()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 530, in readline
    data = self.read(readsize, firstline=True)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 477, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x87 in position 0: invalid start byte

I got here because first I had a "Null Byte" error, which I solved with this post: "Line contains NULL byte" in CSV reader (Python)

Then I got an integer error, which I solved with this post "an integer is required" when open()'ing a file as utf-8?

Then I got an error that said: 'UnicodeException: UTF-16 stream does not start with BOM' which I solved with this post utf-16 file seeking in python. how?

Then I realized that the csv module requires utf-8 so here I am.

But I've finally hit the limit of the existing questions. I can't figure out what is going on. Please please help.

Have you considered using one of the encoding error handlers with the ```errors``` argument - https://docs.python.org/2.7/library/codecs.html#codecs.replace_errors ? — wwii, Dec 09 '14 at 04:05

score 1 · Accepted Answer · answered Dec 09 '14 at 05:53

I'm not sure why but this ultimately worked:

import csv
import os
import unicodecsv

folder = '/Users/jolijttamanaha/Documents/Senior/Thesis/Python/TextAnalysis/datedmatchedngrams3/'

for file in os.listdir (folder):
    with open(os.path.join(folder, file), mode='rU') as f:
        try:
            m=min(int(line[1]) for line in unicodecsv.reader(f, encoding='utf-8', errors='replace'))
        except:
            print "one no work"
            continue
        f.seek(0)
        for line in unicodecsv.reader(f):
            if int(line[1])==m:
                print line

This works due to the combination of "encoding='utf-8', errors='replace'". This makes it use utf8 and replaces any characters that don't convert properly. — shao.lo, May 27 '17 at 00:58

score 0 · Answer 2 · answered Dec 09 '14 at 03:10

Perhaps try using a os.walk along with using for files in files?

folder = '/Users/jolijttamanaha/Documents/Senior/Thesis/Python/TextAnalysis/datedmatchedngrams2/'
    for subdir, dirs, files in os.walk(folder):
        for file in files:
             with codecs.open(file, mode='rU', encoding='utf-16-be') as f:
                   #Your code here

score 0 · Answer 3 · answered Dec 09 '14 at 04:38

0

Clearly then your file is not encoded in UTF-8. Try another encoding. If you're using Windows, 'mbcs' will use the default encoding for your version of Windows.

answered Dec 09 '14 at 04:38

Mark Ransom

299,747
42
398
622

I've used unicodecsv to encode all of the files as utf-8 and still no luck. – Jolijt Tamanaha Dec 09 '14 at 04:52
@JolijtTamanaha I've never heard of unicodecsv and my answer still stands. The error message indicates clearly that the file is not UTF-8. Have you tried my suggestion? – Mark Ransom Dec 09 '14 at 05:16
yes, I've tried every encoding I know and get variations on the same decoding error – Jolijt Tamanaha Dec 09 '14 at 05:19
@JolijtTamanaha that's not possible, there are some encodings that will accept every single byte value. Have you tried `'cp1250'` yet, it's not complete but seems a very likely candidate. – Mark Ransom Dec 09 '14 at 05:26
Thank you for your help but it ended up working with UTF-8 through unicodecsv after all – Jolijt Tamanaha Dec 09 '14 at 05:54

Terrible "invalid start byte" Unicode Error with Opening a CSV file

3 Answers3