Why is the length of json.dumps(json.loads(line))
in general bigger than the length of line?
I have a huge amount of json objects and I tried to sort the bad ones (to few details) out. So I read the json objects from a .jsonl file and saved the good json objects (with a lot of details) in a new file. I sorted out around 60% of the json objects but my file was just around 6% smaller and I found it strange and ran a test:
I compare the length of a json object line with json.dumps(json.loads(line))
. The length of the objects json.dumps(json.loads(line))
were from 83% to 121% of the length of line. The average length of json.dumps(json.loads(line))
was 109,5% of the length of line.
Why is it and how can I prevent it? How can I create a subfile with Python without increasing the file by 10%?
I found an example:
b = r'{"Ä": "ß"}'
print(len(b))
print(len(json.dumps(json.loads(b))))
print(len(json.dumps(json.loads(b), separators=(',', ':'))))
The output is 10, 20 and 19. We see the difference of the compact encoding is just one whitespace. But the dump of the load is twice as long as the original. When I print out json.dumps(json.loads(b)) I get
{"\u00c4": "\u00df"}
It seems that json.dumps() encoded characters like Ä and ß not very space saving. I could try to write my own dumps function with a better encoding, but I like to save the time.
I just found Stackoverflow: Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence (and the comment of Finomnis in the answer).
If I try
b = r'{"Ä": "ß"}'
print(len(b))
print(len(json.dumps(json.loads(b), ensure_ascii=False)))
then I get in both cases the length 10. Yay :-D