This is a common problem where I have respected the following rules (but probably wrongly):
- decode inputs
- encode outputs
- work in utf8 in between
Here is an excerpt of my code:
#!/usr/bin/env python
# encoding: utf-8
m = dict()
with io.open('test.json','r', encoding="utf-8") as f:
m = json.load(f)
with io.open("test.csv",'w', encoding="utf-8") as ficS:
line = list()
for i in m['v']:
v = m['v']['u']
line.append(v['label'].replace("\n", " - "))
ficS.write(';'.join(line).encode('utf-8') + '\n')
Without .encode('utf-8')
, it works, but the file is barely readable due to accentuated letters. With it, I have the following error message:
__main__.py: UnicodeDecodeError('ascii', 'blabla\xc3\xa9blabla', 31, 32, 'ordinal not in range(128)')
You are encoding to UTF-8, then re-encoding to UTF-8. Python can only do this if it first decodes again to Unicode, but it has to use the default ASCII codec. Don't keep encoding; leave encoding to UTF-8 to the last possible moment instead. Concatenate Unicode values instead.
Any idea please?