Why is the length of json.dumps(json.loads(line)) bigger than line?

Question

Why is the length of json.dumps(json.loads(line)) in general bigger than the length of line?

I have a huge amount of json objects and I tried to sort the bad ones (to few details) out. So I read the json objects from a .jsonl file and saved the good json objects (with a lot of details) in a new file. I sorted out around 60% of the json objects but my file was just around 6% smaller and I found it strange and ran a test: I compare the length of a json object line with json.dumps(json.loads(line)). The length of the objects json.dumps(json.loads(line)) were from 83% to 121% of the length of line. The average length of json.dumps(json.loads(line)) was 109,5% of the length of line.

Why is it and how can I prevent it? How can I create a subfile with Python without increasing the file by 10%?

I found an example:

b = r'{"Ä": "ß"}'
print(len(b))
print(len(json.dumps(json.loads(b))))
print(len(json.dumps(json.loads(b), separators=(',', ':'))))

The output is 10, 20 and 19. We see the difference of the compact encoding is just one whitespace. But the dump of the load is twice as long as the original. When I print out json.dumps(json.loads(b)) I get

{"\u00c4": "\u00df"}

It seems that json.dumps() encoded characters like Ä and ß not very space saving. I could try to write my own dumps function with a better encoding, but I like to save the time.

I just found Stackoverflow: Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence (and the comment of Finomnis in the answer).

If I try

b = r'{"Ä": "ß"}'
print(len(b))
print(len(json.dumps(json.loads(b), ensure_ascii=False)))

then I get in both cases the length 10. Yay :-D

Hm, in that case I have to try to create an example... I thought that it might be a known effect. — Mundron, Jun 12 '19 at 13:33
But `print(len(json.dumps(json.loads(b), separators=(',', ':') ,ensure_ascii=False)))` gives you length 9 :P — Finomnis, Jun 13 '19 at 07:03
Also, if you are really worried about space efficiency, you should **gzip** it or something afterwards, json files are *very* compressible — Finomnis, Jun 13 '19 at 07:05

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

2

Did you try compact encoding?

json.dumps(json.loads(line), separators=(',', ':'))

Also, you might want to disable ascii-encoding if you really want to save space, but this might not be compatible with all json libraries, therefore use it with caution.

json.dumps(json.loads(line), separators=(',', ':'), ensure_ascii=False)

Example

import json

a = [[1, 2, 3], {'a':1, 'b':2, 'c':'ä'}]

print(json.dumps(a))
print(json.dumps(a, separators=(',', ':')))
print(json.dumps(a, separators=(',', ':'), ensure_ascii=False))

gives:

[[1, 2, 3], {"a": 1, "b": 2, "c": "\u00e4"}]
[[1,2,3],{"a":1,"b":2,"c":"\u00e4"}]
[[1,2,3],{"a":1,"b":2,"c":"ä"}]

edited Jun 20 '20 at 09:12

Community

1
1

answered Jun 12 '19 at 13:19

Finomnis

18,094
1
20
27

Oh nice. That works far better. On average, my loaded and compact encoded dumped jsons objects are just 1,7% instead of 9,5% bigger than the original. But do you know why this happens? I thought that json is standardized such that the representing string of a json object is always the same. But I doesn't seems so. – Mundron Jun 12 '19 at 13:59
Should be ... maybe your previous files don't completely follow the standard – Finomnis Jun 12 '19 at 14:00
Further investigation won't be possible without having some json examples ... But it can't be that hard to find the differences, just walk through the input and output strings until you find a difference – Finomnis Jun 12 '19 at 14:03
I found that the difference appears at characters like Ä and ß. I included an example in my question. – Mundron Jun 13 '19 at 06:25
@Mundron: Add `ensure_ascii=False` to the parameters to `dumps`, and it will let Unicode characters like Ä and ß to be kept as literals in the output. The default is `True`, which means they'll always be escaped. – Blckknght Jun 13 '19 at 06:37
Oh, thanks a lot. That is exactly what I was looking for! – Mundron Jun 13 '19 at 06:44

Why is the length of json.dumps(json.loads(line)) bigger than line?

1 Answers1

Example