Compressing A Series of JSON Objects While Maintaining Serial Reading?

Question

I have a bunch of json objects that I need to compress as it's eating too much disk space, approximately 20 gigs worth for a few million of them.

Ideally what I'd like to do is compress each individually and then when I need to read them, just iteratively load and decompress each one. I tried doing this by creating a text file with each line being a compressed json object via zlib, but this is failing with a

decompress error due to a truncated stream,

which I believe is due to the compressed strings containing new lines.

Anyone know of a good method to do this?

Show minimal code reproducing the problem. You're "doing something wrong", but too hard to *guess* without seeing what you've done. — Tim Peters, Dec 08 '13 at 03:44
Compressing them individually is going to reduce the potential space savings from compression. — Amber, Dec 08 '13 at 03:44
Do you require random access to the objects, or will you be reading them sequentially? — John Kugelman, Dec 08 '13 at 03:44

score 24 · Accepted Answer · edited Feb 16 '19 at 19:29

24

Just use a gzip.GzipFile() object and treat it like a regular file; write JSON objects line by line, and read them line by line.

The object takes care of compression transparently, and will buffer reads, decompressing chucks as needed.

import gzip
import json

# writing
with gzip.GzipFile(jsonfilename, 'w') as outfile:
    for obj in objects:
        outfile.write(json.dumps(obj) + '\n')

# reading
with gzip.GzipFile(jsonfilename, 'r') as infile:
    for line in infile:
        obj = json.loads(line)
        # process obj

This has the added advantage that the compression algorithm can make use of repetition across objects for compression ratios.

edited Feb 16 '19 at 19:29

martineau

119,623
25
170
301

answered Dec 08 '13 at 03:47

Martijn Pieters

1,048,767
296
4,058
3,343

2

Or `bz2.BZ2File` since 2.3, or `lzma.LZMAFile` since 3.3. – Steve Jessop Dec 08 '13 at 03:59
Worked great! Exactly what I needed. – Newmu Dec 08 '13 at 04:00
If the application that's going to catch these files is expecting json, though, this won't work without massaging of the sort Martijn provides with his reading method. Ie an application expecting raw json will be disappointed. – duhaime Oct 21 '15 at 23:54
1

@duhaime: of course! Producing compressed JSON is an exception, not a usual case. When exchanging JSON over HTTP, the HTTP server may still apply content compression transparently, but you'd leave that to your HTTP library to handle (`requests` does this for you, for example). – Martijn Pieters Oct 22 '15 at 08:26
1

in Python 3 you may have to convert the json string to bytes (like `json_str.encode()`), as the gzip.GzipFile handler expects to write a bytes-like object, not 'str'. – Yibo Yang Jun 28 '17 at 22:03
2

@YiboYang or just wrap the `outfile` object in an [`io.TextIOWrapper` instance](https://docs.python.org/3/library/io.html#io.TextIOWrapper). – Martijn Pieters Jun 28 '17 at 23:28

score 0 · Answer 2 · edited Feb 16 '19 at 19:31

You might want to try an incremental json parser, such as jsaone.

That is, create a single json with all your objects, and parse it like

with gzip.GzipFile(file_path, 'r') as f_in:
    for key, val in jsaone.load(f_in):
        ...

This is quite similar to Martin's answer, wasting slightly more space but maybe slightly more comfortable.

EDIT: oh, by the way, it's probably fair to clarify that I wrote jsaone.

Compressing A Series of JSON Objects While Maintaining Serial Reading?

2 Answers2

Linked