73

For Python3, I followed @Martijn Pieters's code with this:

import gzip
import json

# writing
with gzip.GzipFile(jsonfilename, 'w') as fout:
    for i in range(N):
        uid = "whatever%i" % i
        dv = [1, 2, 3]
        data = json.dumps({
            'what': uid,
            'where': dv})

        fout.write(data + '\n')

but this results in an error:

Traceback (most recent call last):
    ...
  File "C:\Users\Think\my_json.py", line 118, in write_json
    fout.write(data + '\n')
  File "C:\Users\Think\Anaconda3\lib\gzip.py", line 258, in write
    data = memoryview(data)
TypeError: memoryview: a bytes-like object is required, not 'str'

Any thoughts about what is going on?

Community
  • 1
  • 1
Henry Thornton
  • 4,381
  • 9
  • 36
  • 43

4 Answers4

155

You have four steps of transformation here.

  1. a Python data structure (nested dicts, lists, strings, numbers, booleans)
  2. a Python string containing a serialized representation of that data structure ("JSON")
  3. a list of bytes containing a representation of that string ("UTF-8")
  4. a list of bytes containing a - shorter - representation of that previous byte list ("gzip")

So let's take these steps one by one.

import gzip
import json

data = []
for i in range(N):
    uid = "whatever%i" % i
    dv = [1, 2, 3]
    data.append({
        'what': uid,
        'where': dv
    })                                           # 1. data

json_str = json.dumps(data) + "\n"               # 2. string (i.e. JSON)
json_bytes = json_str.encode('utf-8')            # 3. bytes (i.e. UTF-8)

with gzip.open(jsonfilename, 'w') as fout:       # 4. fewer bytes (i.e. gzip)
    fout.write(json_bytes)                       

Note that adding "\n" is completely superfluous here. It does not break anything, but beyond that it has no use. I've added that only because you have it in your code sample.

Reading works exactly the other way around:

with gzip.open(jsonfilename, 'r') as fin:        # 4. gzip
    json_bytes = fin.read()                      # 3. bytes (i.e. UTF-8)

json_str = json_bytes.decode('utf-8')            # 2. string (i.e. JSON)
data = json.loads(json_str)                      # 1. data

print(data)

Of course the steps can be combined:

with gzip.open(jsonfilename, 'w') as fout:
    fout.write(json.dumps(data).encode('utf-8'))                       

and

with gzip.open(jsonfilename, 'r') as fin:
    data = json.loads(fin.read().decode('utf-8'))
Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • 3
    What a terrific and thoughtful answer. It works now. Thanks! – Henry Thornton Sep 12 '16 at 13:31
  • Ooops. Sorry. Upvoted! I'm guessing that most people (including me) think that by accepting an answer delivers an automatic upvote. SO should probably be setup that way. ps. Don't take it personally. – Henry Thornton Sep 12 '16 at 13:53
  • I'm not taking it personally. I am just scratching my head every time. Upvote+no accept: "This was helpful, another answer was a bit better" - Upvote+accept "This was helpful, it solved my issue." - No upvote+accept "this was not really what I was looking for, but I'll use it for lack of alternatives" - No Upvote+accept+enthusiastic comment: [does not compute]. :D – Tomalak Sep 12 '16 at 13:57
  • I've been using SO for many years and I'm guilty for accepting answers but not upvoting. I've learned the lesson today but I'm not sure it solves the problem for the universe of users. – Henry Thornton Sep 12 '16 at 14:00
  • Well, it's just my interpretation of the site's mechanics. The upvote and accept functions are separate for a reason, so there are multiple ways of expressing how you think about an answer. "no upvote+accept" is a completely valid choice. But not upvoting something you find helpful (no matter if it's your thread or somebody else's, question or answer) makes no sense - to me, at least. – Tomalak Sep 12 '16 at 14:05
  • Maybe when an answer is accepted, SO should prompt for an upvote. Anyway, this is becoming an extended discussion. Again, many thanks. – Henry Thornton Sep 12 '16 at 14:08
  • @Tomalak is it compulsory to convert the JSON file into a string then in bytes for compression? Will conversion time should be counted in case comparing 2 different compression methods?I might be missing basics of compression but still wanted to know the answer. –  Jun 01 '17 at 14:25
  • @ranadan You can compress any stream of bytes. How you end up with a stream of bytes is entirely up to you. Converting a Python data structure to JSON (*serializing* it as JSON) is one way to make it into a stream of bytes. There might be other serializers, JSON just happens to be an extremely common one. [pickle](https://docs.python.org/2/library/pickle.html) is a Python-specific serializer that turns Python objects into a stream of bytes. If you don't intend to share data across different environments, a platform-specific serializer might be the better choice. – Tomalak Jun 01 '17 at 16:00
  • @Tomalak So when I am comparing 2 different compression methods with respect to time. I should account the time I take to convert the JSON into bytes as it is part of compression??. –  Jun 01 '17 at 17:37
  • @Tomalak see this once. [URL](https://stackoverflow.com/questions/44306084/comprising-different-compressing-methods-for-json-data-in-python3) . For me i am using JSON only and trying to compare different compression methods with respect to time and space. –  Jun 01 '17 at 17:39
  • That's a rather subjective question. If you see "getting an object from memory to a compressed file" as a single operation, then yes. In effect you are not really trying to measure the performance of the zip algorithm, or are you? – Tomalak Jun 01 '17 at 17:42
  • @Tomalak yes i am trying to measure the performance of zip algorithm. I have JSON data which i have to send to the server. For which I usually gzip and send the data. But I want to check actual performance on which i will choose any of the compression algorithm. one more thing for converting and existing JSON file to bytes is there a better way than this? `with open('data.json','r') as fid_json: # get json as type dict json_dict = json.load(fid_json) # convert dict to str json_str = str(json_dict)` –  Jun 02 '17 at 04:16
  • When you already have JSON why would you want parse and serialize it again? (Rhetorical question, don't answer it. Think about it instead.) – Tomalak Jun 02 '17 at 06:17
  • 4
    This has an event more compact approach that lets gzip handle the encoding and json uses the stream from gzip directly: https://stackoverflow.com/a/49535758/1236083 – Rafe Mar 07 '19 at 18:19
  • Thanks for the cross-link @Rafe – Tomalak Mar 07 '19 at 18:45
  • we can simplify the final example further with `json.load` instead of `loads` as just `data = json.load(fin)` – patricksurry Jun 20 '19 at 14:01
  • True, `json.load` can (as of Python 3.6) deal with byte input. But it lacks a way of specifying the encoding and assumes a default instead. Since explicit is better than implicit, I prefer specifying the encodings myself. (Also I wanted to keep it equivalent to the longer sample above.) – Tomalak Jun 20 '19 at 15:46
  • I am making some test. `"\n"` does something if `data` is a dictionary with many dictionaries in it. In particular it successively allows to read the file line by line. Is that possible? – GRquanti Oct 15 '19 at 18:53
  • There is no such thing as line-wise reading of JSON. You can have a file format that has one JSON-encoded entity per line, but that file *itself* is not JSON anymore. – Tomalak Oct 16 '19 at 12:40
  • 1
    @greenie-beans Whatever `json.dumps` does is not related to gzip at all. I would suggest to make a short code sample (only the `json` part of the code, nothing else) which reproduces the error you see, and ask that as a stand-alone question. – Tomalak Jun 25 '20 at 15:25
  • @greenie-beans Never mind, you got it solved, that's the important part. Comments can be deleted. :) – Tomalak Jun 25 '20 at 16:05
  • Just curious: Why did you use the `GzipFile` class directly instead of using [the `open` function](https://docs.python.org/3/library/gzip.html#gzip.open)? – Harm Nov 07 '20 at 10:35
  • @Harm Hm... I guess it didn't occur to me at the time. :) – Tomalak Nov 07 '20 at 13:01
  • @Tomalak No worries, just thought it looks cleaner, mirroring [the regular `open` function](https://docs.python.org/3/library/functions.html#open) – Harm Nov 07 '20 at 13:51
38

The solution mentioned here (thanks, @Rafe) has a big advantage: as encoding is done on-the-fly, you don't create two complete, intermediate string objects of the generated json. With big objects, this saves memory.

with gzip.open(jsonfilename, 'wt', encoding='UTF-8') as zipfile:
    json.dump(data, zipfile)

In addition, reading and decoding is simple as well:

with gzip.open(jsonfilename, 'rt', encoding='UTF-8') as zipfile:
    my_object = json.load(zipfile)
Harm
  • 590
  • 3
  • 21
JanFrederik
  • 560
  • 5
  • 7
1

Explosion (makers of Spacy) maintains a package called 'srsly' for serializing different file types. They don't advertise it much, but it's really useful.

Writting to GZIP:

import srsly

data = {"foo": "bar", "baz": 123}
srsly.write_gzip_json("/path/to/file.json.gz", data)

Reading GZIP:

data = srsly.read_gzip_json("/path/to/file.json.gz")

You can read more on their pypi page: https://pypi.org/project/srsly/

Liam Roberts
  • 924
  • 7
  • 7
0

In order to write to a json.gz, you can use the following snippet:

import json
import gzip

with gzip.open("file_path_to_write", "wt") as f:
        json.dump(expected_dict, f)

And to read from a json.gz, you can use the following snippet:

import json
import gzip

with gzip.open("file_path_to_read", "rt") as f:
        expected_dict = json.load(f)
```
Amir nazary
  • 384
  • 1
  • 7