0

I'm trying to convert an xml with Hebrew characters in it, to json.

here's a piece of the xml:

     <Root><City>ירושלים</City></Root>

I've tried several methods and encodings, but I always get this unconvertable ascii code instead of Hebrew characters.:

\u05e7\u05e0\u05d9\u05d5\u05df

Here's my conversion script:

import xmltodict, json
import sys

infile = open('text.xml','r')
outfile = open('text.json','w')
xmltxt = infile.read().decode('utf_16_le')

decodedXML = xmltodict.parse(xmltxt)

jsontext = json.dumps(decodedXML, encoding='utf_8')

outfile.write(jsontext.encode('utf_8'))
infile.close()
outfile.close()

Thank you

EDIT - Solution: Solved by adding 'ensure_ascii=False' to json.dumps()

eladc
  • 120
  • 1
  • 11
  • 4
    You have valid JSON; `\uhhhh` are JSON escape sequences used to keep your JSON output ASCII-safe. If you want UTF-8 output, see [Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence](https://stackoverflow.com/q/18337407) – Martijn Pieters May 27 '15 at 20:53
  • 1
    You should not need to decode infile, the encoding should be part of the header. And you don't need to encode jsontext twice. It should already be a byte-string. The double encode is a nop, and looks like cargo-cult-programing. – deets May 27 '15 at 21:03
  • Thank you @MartijnPieters! @deets thank you for your notes ! – eladc May 27 '15 at 21:03

0 Answers0