1

Why is the length of json.dumps(json.loads(line)) in general bigger than the length of line?

I have a huge amount of json objects and I tried to sort the bad ones (to few details) out. So I read the json objects from a .jsonl file and saved the good json objects (with a lot of details) in a new file. I sorted out around 60% of the json objects but my file was just around 6% smaller and I found it strange and ran a test: I compare the length of a json object line with json.dumps(json.loads(line)). The length of the objects json.dumps(json.loads(line)) were from 83% to 121% of the length of line. The average length of json.dumps(json.loads(line)) was 109,5% of the length of line.

Why is it and how can I prevent it? How can I create a subfile with Python without increasing the file by 10%?

I found an example:

b = r'{"Ä": "ß"}'
print(len(b))
print(len(json.dumps(json.loads(b))))
print(len(json.dumps(json.loads(b), separators=(',', ':'))))

The output is 10, 20 and 19. We see the difference of the compact encoding is just one whitespace. But the dump of the load is twice as long as the original. When I print out json.dumps(json.loads(b)) I get

{"\u00c4": "\u00df"}

It seems that json.dumps() encoded characters like Ä and ß not very space saving. I could try to write my own dumps function with a better encoding, but I like to save the time.

I just found Stackoverflow: Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence (and the comment of Finomnis in the answer).

If I try

b = r'{"Ä": "ß"}'
print(len(b))
print(len(json.dumps(json.loads(b), ensure_ascii=False)))

then I get in both cases the length 10. Yay :-D

Mundron
  • 25
  • 5

1 Answers1

2

Did you try compact encoding?

json.dumps(json.loads(line), separators=(',', ':'))

Also, you might want to disable ascii-encoding if you really want to save space, but this might not be compatible with all json libraries, therefore use it with caution.

json.dumps(json.loads(line), separators=(',', ':'), ensure_ascii=False)


Example

import json

a = [[1, 2, 3], {'a':1, 'b':2, 'c':'ä'}]

print(json.dumps(a))
print(json.dumps(a, separators=(',', ':')))
print(json.dumps(a, separators=(',', ':'), ensure_ascii=False))

gives:

[[1, 2, 3], {"a": 1, "b": 2, "c": "\u00e4"}]
[[1,2,3],{"a":1,"b":2,"c":"\u00e4"}]
[[1,2,3],{"a":1,"b":2,"c":"ä"}]
Community
  • 1
  • 1
Finomnis
  • 18,094
  • 1
  • 20
  • 27
  • Oh nice. That works far better. On average, my loaded and compact encoded dumped jsons objects are just 1,7% instead of 9,5% bigger than the original. But do you know why this happens? I thought that json is standardized such that the representing string of a json object is always the same. But I doesn't seems so. – Mundron Jun 12 '19 at 13:59
  • Should be ... maybe your previous files don't completely follow the standard – Finomnis Jun 12 '19 at 14:00
  • Further investigation won't be possible without having some json examples ... But it can't be that hard to find the differences, just walk through the input and output strings until you find a difference – Finomnis Jun 12 '19 at 14:03
  • I found that the difference appears at characters like Ä and ß. I included an example in my question. – Mundron Jun 13 '19 at 06:25
  • @Mundron: Add `ensure_ascii=False` to the parameters to `dumps`, and it will let Unicode characters like Ä and ß to be kept as literals in the output. The default is `True`, which means they'll always be escaped. – Blckknght Jun 13 '19 at 06:37
  • Oh, thanks a lot. That is exactly what I was looking for! – Mundron Jun 13 '19 at 06:44