0

There are many questions on python and unicode/string. However, none of the answers work for me.

First, a file is opened using DictReader, then each row is put into an array. Then the dict value is sent to be converted to unicode.

Step One is getting the data

f = csv.DictReader(open(filename,"r")
data = []
for row in f:
    data.append(row)

Step Two is getting a string value from the dict and replacing the accents (found this from other posts)

s = data[i].get('Name')
strip_accents(s)

def strip_accents(s):
    try: s = unicode(s)
    except: s = s.encode('utf-8')
    s = unicodedata.normalize('NFKD', s).encode('ascii','ignore')
    return s

I use the try and except because some strings have accents, the others dont. What I can not figure out is, the unicode(s) works with a type str that has no accents, however, when a type str has accents, it fails

UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 11: ordinal not in range(128)

I have seen posts on this but the answers do not work. When I use type(s), it says it is <type 'str'> . So I tried to read the file as unicode

f = csv.DictReader(codecs.open(filename,"r",encoding='utf-8'))

But as soon as it goes to read

data = []
for row in f:
    data.append(row)

This error occurs:

  File "F:...files.py", line 9, in files
    for row in f:
  File "C:\Python27\lib\csv.py", line 104, in next
    row = self.reader.next()
  File "C:\Python27\lib\codecs.py", line 684, in next
    return self.reader.next()
  File "C:\Python27\lib\codecs.py", line 615, in next
    line = self.readline()
  File "C:\Python27\lib\codecs.py", line 530, in readline
    data = self.read(readsize, firstline=True)
  File "C:\Python27\lib\codecs.py", line 477, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 0: invalid start byte

Is this error caused by the way dictreader is handling the unicode? How to get around this?


More tests. As @univerio pointed out, one item which is causing the fails is ISO-8859-1

Modifying the open statement to:

f = csv.DictReader(codecs.open(filename,"r",encoding="cp1252"))

produces a slightly different error:

  File "F:...files.py", line 9, in files
    for row in f:
  File "C:\Python27\lib\csv.py", line 104, in next
    row = self.reader.next()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 11: ordinal not in range(128)

Using the basic open statement and modifying strip_accents() such as:

try: s = unicode(s)
except: s = s.decode("iso-8859-1").encode('utf8')
print type(s)
s = unicodedata.normalize('NFKD', s).encode('ascii','ignore')
return str(s)

prints that the type is still str and errors on

s = unicodedata.normalize('NFKD', s).encode('ascii','ignore')
TypeError: must be unicode, not str

based on Python: Converting from ISO-8859-1/latin1 to UTF-8 modifying to

s = unicode(s.decode("iso-8859-1").encode('utf8'))

produces a different error:

except: s = unicode(s.decode("iso-8859-1").encode('utf8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 11: ordinal not in range(128)
Community
  • 1
  • 1
user-2147482637
  • 2,115
  • 8
  • 35
  • 56
  • It means you have bad data. Are you sure your file is UTF-8 encoded? `0xFC` is not the first byte of any valid UTF-8 sequence. (See [here](http://en.wikipedia.org/wiki/UTF-8#Description).) – univerio Aug 28 '14 at 06:06
  • @univerio it fails on this name Thomas C. Südhof ... it is from a csv file on windows, and assumed its utf-8. But maybe that was a bad assumption. How can I check for all types, since there are different names with different characters – user-2147482637 Aug 28 '14 at 06:09
  • What I meant was that when you open the file with `encoding='utf-8'` python cannot decode the first byte of the file because it is not a valid first byte in a UTF-8 sequence. It sounds like your file might be encoded in Windows 1252, since `0xFC` is the hex value of the umlaut. Try opening it with `encoding="cp1252"`. – univerio Aug 28 '14 at 06:16
  • @univerio That fails at a different point with `File "C:\Python27\lib\csv.py", line 104, in next row = self.reader.next() UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 11: ordinal not in range(128)` – user-2147482637 Aug 28 '14 at 06:19
  • is there a way to ask what the encoding is per line? – user-2147482637 Aug 28 '14 at 06:19
  • The error says you're still using the ascii codec. Try `codecs.open(filename, "r", encoding='cp1252')`. – univerio Aug 28 '14 at 06:21
  • @univerio yes that was what i did, and received that error, it was slightly different as it did not reach `line = self.readline()` – user-2147482637 Aug 28 '14 at 06:22
  • possible duplicate of [python module like csv-DictReader with full utf8 support](http://stackoverflow.com/questions/5478659/python-module-like-csv-dictreader-with-full-utf8-support) – tripleee Aug 28 '14 at 06:44
  • It makes no sense for different rows to have a different encoding, you simply need to clean up the file by hand if that is the case. There is no way to deduce the encoding unless you know what the data is supposed to be, in which case the only thing which makes sense is to encode it correctly in the first place. – tripleee Aug 28 '14 at 06:46
  • @triplee in context, the names are coming from different sources so they have different encodings, but i had assumed wrongly they were utf8. univerio solved it for this instance, and I have to modify per case, which is easier than sorting through each name to modify it by hand – user-2147482637 Aug 28 '14 at 06:49

2 Answers2

1

I think this should work:

def strip_accents(s):
    s = s.decode("cp1252")  # decode from cp1252 encoding instead of the implicit ascii encoding used by unicode()
    s = unicodedata.normalize('NFKD', s).encode('ascii','ignore')
    return s

The reason opening the file with the correct encoding didn't work is because DictReader doesn't seem to handle unicode strings correctly.

univerio
  • 19,548
  • 3
  • 66
  • 68
0

Reference here: UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128), by @Duncan 's answer,

print repr(ch)

Example:

string = 'Ka\u011f KO\u011e52 \u0131 \u0130\u00f6\u00d6 David \u00fc K\u00dc\u015f\u015e \u00e7 \u00c7'

print (repr(string))

It prints:

'Kağ KOĞ52 ı İöÖ David ü KÜşŞ ç Ç'
Mark K
  • 8,767
  • 14
  • 58
  • 118