0

This is a common problem where I have respected the following rules (but probably wrongly):

  • decode inputs
  • encode outputs
  • work in utf8 in between

Here is an excerpt of my code:

#!/usr/bin/env python
# encoding: utf-8
        m = dict()
        with io.open('test.json','r', encoding="utf-8") as f:
            m = json.load(f)
        with io.open("test.csv",'w', encoding="utf-8") as ficS:
            line = list()
            for i in m['v']:
                v = m['v']['u']
                line.append(v['label'].replace("\n", " - "))
            ficS.write(';'.join(line).encode('utf-8') + '\n')

Without .encode('utf-8'), it works, but the file is barely readable due to accentuated letters. With it, I have the following error message:

__main__.py: UnicodeDecodeError('ascii', 'blabla\xc3\xa9blabla', 31, 32, 'ordinal not in range(128)')

Here and here, it is said:

You are encoding to UTF-8, then re-encoding to UTF-8. Python can only do this if it first decodes again to Unicode, but it has to use the default ASCII codec. Don't keep encoding; leave encoding to UTF-8 to the last possible moment instead. Concatenate Unicode values instead.

Any idea please?

lalebarde
  • 1,684
  • 1
  • 21
  • 36
  • `line.append(v['label'].replace(u"\n", u" - "))` - once you have decoded, make sure you always use unicode strings when mutating the unicode string. Otherwise Python2 may try to to coerce between the two types. – snakecharmerb Jul 29 '19 at 16:22
  • I have already tried, also in `ficS.write(u';'.join(line).encode('utf-8') + u'\n')` but without success – lalebarde Jul 29 '19 at 16:38
  • Don't manually `.encode('utf8')`. Leave this work to the `ficS` filehandle. So the last line should read: `ficS.write(u';'.join(line) + u'\n')` – lenz Jul 29 '19 at 19:09
  • If the accentuated letters look distorted in the resulting file, then this is most probably because MS Excel (or whatever tool you use to open) doesn't recognise UTF-8. In that case, try a different encoding for the output file, eg. `"utf-8-sig"` or `"utf16"`. – lenz Jul 29 '19 at 19:11

0 Answers0