while reading json file in python some additional unicode symbols appear in the data

Question

I need to get data from json file to further send it in the post-request. Unfortunately, when i read the file, some unexplained unicode symbols at the beginning

path = '.\jsons_updated'
newpath = os.path.join(path, 'Totem Plus eT 00078-20140224_060406.ord.txt')
file = open(newpath, 'r')
#data = json.dumps(file.read())
data = file.read()
print('data= ', data)
file.close()

Data in the file starts with this:

{"PriceTableHash": [{"Hash": ...

I get the result:

data=  п»ї{"PriceTableHash": [{"Hash": ...

or in case of data = json.dumps(file.read())

data=  "\u043f\u00bb\u0457{\"PriceTableHash\": [{

So my request can't process this data. Odd symbols are the same for all the files i have.

UPD: If i copy data manyally in the new json or txt file, problem dissappears. But i have about 2,5k files, so that's not an option =)

score 1 · Accepted Answer · edited May 23 '17 at 10:25

1

The command open(newpath, 'r') opens the file with your system's default encoding (whichever that might be). So when you read encoded Unicode data, that will mangle the encoding (so instead of reading the UTF-8 encoded data with a UTF-8 decoder, Python will try Cp-1250 or something).

Use codecs.open() instead and specify the correct encoding of the data (i.e. the one which was used when the files were written).

The odd bytes you get look like a BOM header. You may want to change the code which writes those files to omit it and send you pure UTF-8. See also Reading Unicode file data with BOM chars in Python

edited May 23 '17 at 10:25

Community

1
1

answered Aug 21 '14 at 08:10

Aaron Digulla

321,842
108
597
820

Thanks! [Reading Unicode file data with BOM chars in Python](http://stackoverflow.com/questions/13590749/reading-unicode-file-data-with-bom-chars-in-python) helped. If i open my files like this: `file = codecs.open(newpath, 'r', 'utf-8-sig')`, everything is ok. – Alex Aug 21 '14 at 09:30

while reading json file in python some additional unicode symbols appear in the data

1 Answers1