Parsing Chinese characters in a JSON object with Python

Question

I have a small Python script that loops through a text file containing traditional and simplified Chinese characters as well as their associated pinyin and English translations and stores them into a JSON object.

Here's the script -

import json
  
# resultant dictionary
dict = {}
  
# fields in the sample file 
fields = ['traditional', 'simplified', 'pinyin', 'english']
  
with open('cedict.txt', encoding = 'utf8') as fh:
      
    # looping logic
  
# creating json file        
new_file = open("cedict.json", "w")
json.dump(dict, new_file, indent = 4)
new_file.close()

Here's a small snippet of the JSON object -

"word93428": {
        "traditional": "\u86e7",
        "simplified": "\u86e7",
        "pinyin": "[wang3]",
        "english": "/old"
    }

The text file is being encoded with utf8 which seems to work fine with Latin-based characters but not Chinese ones.

I've played around with other character encodings and they all yield different errors so the easier solution seems to be to loop through the JSON object and decode the Chinese characters so they look how they should.

This is what I'm stuck on.

I've tested out the decode() function on one of the encoded Chinese characters and it will make the character appear in it's original form.

But I need to loop through an entire JSON object with thousands of translations and only decode the first 2/4 key/value pairs.

How can I achieve this?

To be clear: `"\u86e7"` is the perfectly valid *JSON encoding* for the character 蛧. If you parse your JSON with a compliant JSON parser, you'll get 蛧 out of it. Is that the problem/question? — deceze, Dec 07 '21 at 12:23
I'm assuming instead of json.dump()? Can you give me an example of how you would use it in this context? As I think it only takes one argument. This is my first time working with Python. — , Dec 07 '21 at 12:23
@deceze Yes. My goal is to get 蛧 from "\u86e7" within my JSON file (of course, do that with all the words). — , Dec 07 '21 at 12:27
You used `json.dump()`, `json.load()` is not that different, except it needs just a file opened for *reading* (and no additional object). — Klaus D., Dec 07 '21 at 12:28

Parsing Chinese characters in a JSON object with Python

0 Answers0