2

Here is the simple example:

test = {'location': '北京', 'country': '中国'}  # the values are Chinese.  

In file test.log:

{'location': '北京', 'country': '中国'} 

In python 2.7.8, when I need to output data, I use str() method.

file_out = open('test.log', 'w')
file_out.write(str(test))
file_out.close()

str() method does not work when dict contains other characters. I know in python2 the default is ASCII, and this does not support Chinese.

My questions is that how can I output dict into files? Someone mentioned Json package for me, but I do not how to use.

tripleee
  • 175,061
  • 34
  • 275
  • 318
rayimpr
  • 95
  • 1
  • 14
  • What do you mean "str() method does not work"? – John Zwinck Jan 07 '15 at 07:42
  • str() method will encode test with ASCII. The output will be something like this: `{'country': '\xe4\xb8\xad\xe5\x9b\xbd', 'location': '\xe5\x8c\x97\xe4\xba\xac'}` instead of `{'location': '北京', 'country': '中国'}`. – rayimpr Jan 07 '15 at 07:53
  • I have a similar problem within a different context, I have problems in printing the dictionary itself on HttpResponse. [Here is my problem and solution](http://stackoverflow.com/questions/10883399/unable-to-encode-decode-pprint-output) . Hope it helps. – Mp0int Jan 07 '15 at 09:06

2 Answers2

2

Here is what you want.

#!/usr/bin/python
# -*- coding: utf-8 -*-

import json
ori_test = {'location': '北京', 'country': '中国'}
test = dict([(unicode(k, "utf-8"), unicode(v, "utf-8")) for k, v in ori_test.items()])

my_dict = json.dumps(test, ensure_ascii=False).encode('utf8')
print my_dict
# then write my_dict to the local file as you want

And this link could be helpful for you.

Community
  • 1
  • 1
Stephen Lin
  • 4,852
  • 1
  • 13
  • 26
  • Here in my situation, test is dict type, not string type. I tried to run your code by removing u and the double quotes, and then error occurs. 'ascii' codec can't decode byte 0xe4 in position 13: ordinal not in range(128). – rayimpr Jan 07 '15 at 08:02
  • It should be `test = {u'location': u'北京', u'country': u'中国'}` – tripleee Jan 07 '15 at 08:05
  • This works. But when the value of location generates dynamically, how to add u before the value to declare unicode encoding? @tripleee – rayimpr Jan 07 '15 at 08:11
  • Whatever generates the string should convert it to a Unicode string before returning it. Sounds like maybe you should read http://nedbatchelder.com/text/unipain.html – tripleee Jan 07 '15 at 08:15
  • @Ray I think you have to convert it one by one. – Stephen Lin Jan 07 '15 at 08:18
  • It does what you apparently want it to do but it's not a very comprehensive answer for this particular question. The code which generates the values needs to return Unicode strings (and generally, you should make all of your code use Unicode strings). Then the JSON part will trivially work. – tripleee Jan 07 '15 at 08:36
  • @m170897017 I tried to run your code on both windows and Linux platform. but it fails at this line: `test = dict([(unicode(k), unicode(v)) for k, v in ori_test.items()])`. Error message is "'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)". :( – rayimpr Jan 07 '15 at 08:37
  • @m170897017 What about this method: `import json ori_test = {'location': '北京', 'country': '中国'} my_dict = json.dumps(ori_test, ensure_ascii=False, encoding='utf-8') print my_dict` This works well! – rayimpr Jan 07 '15 at 09:13
  • @Ray After checking the source code of json.dumps, I found that key point here is ensure_ascii=False. In your method, without encoding='utf-8' it will still work, since default value of encoding is 'utf-8'. Anyway, this is a simpler solution. – Stephen Lin Jan 07 '15 at 09:23
1

The code which populates this structure should produce Unicode strings (Python 2 u"..." strings), not byte strings (Python 2 "..." strings). See http://nedbatchelder.com/text/unipain.html for a good introduction to the pertinent differences between these two data types.

Building on (an earlier version of) m170897017's answer;

#!/usr/bin/python
# -*- coding: utf-8 -*-

import json
test = {u'location': u'北京', u'country': u'中国'}
my_json = json.dumps(test, ensure_ascii=False).encode('utf8')
print my_json

If you have code which programmatically populates the location field, make it populate it with a Unicode string. For example, if you read UTF-8 data from somewhere, decode() it before putting it there.

def update_location ():
    location = '北京'
    return location.decode('utf-8')

test['location'] = update_location()

You could use other serialization formats besides JSON, including the str() representation of the Python structure, but JSON is standard, well-defined, and well-documented. It requires all strings to be UTF-8, so it works trivially for non-English strings.

Python2 works internally with either byte strings or Unicode strings, but in this scenario, Unicode strings should be emphatically recommended, and will be the only sensible choice if/when you move to Python3. Convert everything to Unicode as soon as you can and convert (back?) to an external representation (e.g. UTF-8) only when you have to.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • Ack contribution by @m170897017 – tripleee Jan 07 '15 at 08:41
  • Thanks for your useful link. It helps me have a better understanding of unicode and byte string. I will throughly read it. But in my situation, I want to use utf-8 encoding to easily identify the value. Unicode is hard to read. – rayimpr Jan 07 '15 at 08:57
  • @tripleee Your last sentence is valuable and helpful to avoid many pitfalls! Thanks! – rayimpr Jan 07 '15 at 09:06