0

I have some non-ASCII characters in a json file. There is a list and that list has strings like SMΛN. When I read json file and print that list

with open ("strings.json") as f:
    t = json.load(f)

print (t)

I got that non-ASCII string like 'SMΛN\n'. How can I decode it as utf-8 or something like I can print unicode characters properly? I tried this

with open ("strings.json",encoding = 'utf-8') as f:
    t = json.load(f).encode('utf-8').decode('utf-8')

But the output is still same.

David
  • 17,673
  • 10
  • 68
  • 97
GLHF
  • 3,835
  • 10
  • 38
  • 83
  • 1
    What does `locale` tell? It is one problem to have the unicode encoded as bytes in the file, and the other decoding on read in the reading environment and then again encoding for output (which should depend on locale with print) – Dilettant Jul 07 '16 at 10:20
  • 1
    It could also be that the file you are reading is not UTF-8. Are you sure that is the correct encoding? – syntonym Jul 07 '16 at 10:21
  • @syntonym I took the data from a .txt file and then `json.dump()`, then read it from json as above. The file is encoded utf-8 I'm sure about that – GLHF Jul 07 '16 at 10:22
  • @David Yes it's Windows-7 – GLHF Jul 07 '16 at 10:29
  • What happens if you open the file as binary? (so that the decoding is done by the json library) – RemcoGerlich Jul 07 '16 at 10:31
  • 1
    FWIW, `'SMΛN'.encode('utf8').decode('cp1252')` results in `'SMΛN'`; and `'SMΛN'.encode('cp1252').decode('utf8')` results in `'SMΛN'`. Also, `'SMΛN'.encode('utf8')` results in `b'SM\xce\x9bN'` – PM 2Ring Jul 07 '16 at 10:31
  • @PM2Ring Well you're right. Could you post your comment as answer please – GLHF Jul 07 '16 at 10:36
  • @GLHF Sorry, I was just noting the relationship between those two strings. I don't have an actual answer to the cause of your problem, although I assume it has something to do with your console using codepage 1252. Does David's answer work for you? I can't test it because I don't use Windows. – PM 2Ring Jul 07 '16 at 10:40
  • What do you get from `print(b'SM\xce\x9bN'.decode('utf8'))` in the terminal? If the terminal's encoding is UTF-8, you should get `SMΛN`, but if the encoding is codepage 1252 you will get `SMΛN`. So the proper solution is to set the correct encoding in the terminal. That's very easy to do in my terminal, but I have no idea of how to do it on a Windows system. – PM 2Ring Jul 07 '16 at 10:45

1 Answers1

0

In Python 3, open defaults to using the encoding returned by locale.getpreferredencoding() which on US-localized Windows is cp1252.

Open the file with UTF-8 encoding instead:

#coding:utf8
import json
L = ['SMΛN']

# Generate an example file in UTF-8.
with open('out.json','w',encoding='utf8') as f:
    json.dump(L,f,ensure_ascii=False)

# open with default encoding
with open('out.json') as f:
    L = json.load(f)

print(L)

# open with correct encoding
with open('out.json', encoding='utf8') as f:
    L = json.load(f)

print(L)

Output:

['SMΛN']
['SMΛN']

Note that the print will only work correctly if your IDE (and font) supports Unicode characters being printed.

P.S. chcp 65001 to view the output on Windows console as suggested in another answer is broken (Note the extra ] line. This was Windows 7):

C:\>chcp 65001
Active code page: 65001

C:\>test
['SMΛN']
]
['SMΛN']
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251