0

Please please please help. I've been strugglign with this for a while and ran into problem after problem. I'm just trying to make a loop that opens every csv file in a folder. Here's the loop:

folder = '/Users/jolijttamanaha/Documents/Senior/Thesis/Python/TextAnalysis/datedmatchedngrams2/'

for file in os.listdir (folder):
    with codecs.open(file, mode='rU', encoding='utf-8') as f:
        m=min(int(line[1]) for line in csv.reader(f))
        f.seek(0)
        for line in csv.reader(f):
            if int(line[1])==m:
                print line

Here's the error:

Traceback (most recent call last):
  File "findfirsttrigram.py", line 11, in <module>
    m=min(int(line[1]) for line in csv.reader(f))
  File "findfirsttrigram.py", line 11, in <genexpr>
    m=min(int(line[1]) for line in csv.reader(f))
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 684, in next
    return self.reader.next()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 615, in next
    line = self.readline()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 530, in readline
    data = self.read(readsize, firstline=True)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 477, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x87 in position 0: invalid start byte

I got here because first I had a "Null Byte" error, which I solved with this post: "Line contains NULL byte" in CSV reader (Python)

Then I got an integer error, which I solved with this post "an integer is required" when open()'ing a file as utf-8?

Then I got an error that said: 'UnicodeException: UTF-16 stream does not start with BOM' which I solved with this post utf-16 file seeking in python. how?

Then I realized that the csv module requires utf-8 so here I am.

But I've finally hit the limit of the existing questions. I can't figure out what is going on. Please please help.

Community
  • 1
  • 1
Jolijt Tamanaha
  • 333
  • 2
  • 9
  • 23
  • Have you considered using one of the encoding error handlers with the ```errors``` argument - https://docs.python.org/2.7/library/codecs.html#codecs.replace_errors ? – wwii Dec 09 '14 at 04:05

3 Answers3

1

I'm not sure why but this ultimately worked:

import csv
import os
import unicodecsv

folder = '/Users/jolijttamanaha/Documents/Senior/Thesis/Python/TextAnalysis/datedmatchedngrams3/'

for file in os.listdir (folder):
    with open(os.path.join(folder, file), mode='rU') as f:
        try:
            m=min(int(line[1]) for line in unicodecsv.reader(f, encoding='utf-8', errors='replace'))
        except:
            print "one no work"
            continue
        f.seek(0)
        for line in unicodecsv.reader(f):
            if int(line[1])==m:
                print line
Jolijt Tamanaha
  • 333
  • 2
  • 9
  • 23
  • This works due to the combination of "encoding='utf-8', errors='replace'". This makes it use utf8 and replaces any characters that don't convert properly. – shao.lo May 27 '17 at 00:58
0

Perhaps try using a os.walk along with using for files in files?

folder = '/Users/jolijttamanaha/Documents/Senior/Thesis/Python/TextAnalysis/datedmatchedngrams2/'
    for subdir, dirs, files in os.walk(folder):
        for file in files:
             with codecs.open(file, mode='rU', encoding='utf-16-be') as f:
                   #Your code here
Torkoal
  • 457
  • 5
  • 17
0

Clearly then your file is not encoded in UTF-8. Try another encoding. If you're using Windows, 'mbcs' will use the default encoding for your version of Windows.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622