There are many questions on python and unicode/string. However, none of the answers work for me.
First, a file is opened using DictReader
, then each row is put into an array. Then the dict value is sent to be converted to unicode.
Step One is getting the data
f = csv.DictReader(open(filename,"r")
data = []
for row in f:
data.append(row)
Step Two is getting a string value from the dict and replacing the accents (found this from other posts)
s = data[i].get('Name')
strip_accents(s)
def strip_accents(s):
try: s = unicode(s)
except: s = s.encode('utf-8')
s = unicodedata.normalize('NFKD', s).encode('ascii','ignore')
return s
I use the try and except because some strings have accents, the others dont. What I can not figure out is, the unicode(s)
works with a type str
that has no accents, however, when a type str
has accents, it fails
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 11: ordinal not in range(128)
I have seen posts on this but the answers do not work. When I use type(s), it says it is <type 'str'>
. So I tried to read the file as unicode
f = csv.DictReader(codecs.open(filename,"r",encoding='utf-8'))
But as soon as it goes to read
data = []
for row in f:
data.append(row)
This error occurs:
File "F:...files.py", line 9, in files
for row in f:
File "C:\Python27\lib\csv.py", line 104, in next
row = self.reader.next()
File "C:\Python27\lib\codecs.py", line 684, in next
return self.reader.next()
File "C:\Python27\lib\codecs.py", line 615, in next
line = self.readline()
File "C:\Python27\lib\codecs.py", line 530, in readline
data = self.read(readsize, firstline=True)
File "C:\Python27\lib\codecs.py", line 477, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 0: invalid start byte
Is this error caused by the way dictreader is handling the unicode? How to get around this?
More tests. As @univerio pointed out, one item which is causing the fails is ISO-8859-1
Modifying the open statement to:
f = csv.DictReader(codecs.open(filename,"r",encoding="cp1252"))
produces a slightly different error:
File "F:...files.py", line 9, in files
for row in f:
File "C:\Python27\lib\csv.py", line 104, in next
row = self.reader.next()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 11: ordinal not in range(128)
Using the basic open statement and modifying strip_accents() such as:
try: s = unicode(s)
except: s = s.decode("iso-8859-1").encode('utf8')
print type(s)
s = unicodedata.normalize('NFKD', s).encode('ascii','ignore')
return str(s)
prints that the type is still str and errors on
s = unicodedata.normalize('NFKD', s).encode('ascii','ignore')
TypeError: must be unicode, not str
based on Python: Converting from ISO-8859-1/latin1 to UTF-8 modifying to
s = unicode(s.decode("iso-8859-1").encode('utf8'))
produces a different error:
except: s = unicode(s.decode("iso-8859-1").encode('utf8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 11: ordinal not in range(128)