25

I have a multi-gigabyte JSON file. The file is made up of JSON objects that are no more than a few thousand characters each, but there are no line breaks between the records.

Using Python 3 and the json module, how can I read one JSON object at a time from the file into memory?

The data is in a plain text file. Here is an example of a similar record. The actual records contains many nested dictionaries and lists.

Record in readable format:

{
    "results": {
      "__metadata": {
        "type": "DataServiceProviderDemo.Address"
      },
      "Street": "NE 228th",
      "City": "Sammamish",
      "State": "WA",
      "ZipCode": "98074",
      "Country": "USA"
    }
  }
}

Actual format. New records start one after the other without any breaks.

{"results": { "__metadata": {"type": "DataServiceProviderDemo.Address"},"Street": "NE 228th","City": "Sammamish","State": "WA","ZipCode": "98074","Country": "USA" } } }{"results": { "__metadata": {"type": "DataServiceProviderDemo.Address"},"Street": "NE 228th","City": "Sammamish","State": "WA","ZipCode": "98074","Country": "USA" } } }{"results": { "__metadata": {"type": "DataServiceProviderDemo.Address"},"Street": "NE 228th","City": "Sammamish","State": "WA","ZipCode": "98074","Country": "USA" } } }
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Cam
  • 478
  • 1
  • 4
  • 13
  • 1
    Post a sample of the data, at least few *objects*. – Bibhas Debnath Feb 11 '14 at 17:02
  • You mean the JSON file is an array of objects, and you want to lazily read those objects? – poke Feb 11 '14 at 17:03
  • And did you already search for other posts on this very subject, here on Stack Overflow? There is at least one listed in the 'related' sidebar here that I can see. How did those posts not address your specific situation? – Martijn Pieters Feb 11 '14 at 17:11
  • @poke I'm not sure what you mean by 'lazily', but yes I think that is what I want. – Cam Feb 11 '14 at 17:26
  • @MartijnPieters None of the other posts I could find addressed the same problem. Could you share the link with the solution you found? – Cam Feb 11 '14 at 17:29
  • It sounds like you're looking for a [streaming JSON parser](https://stackoverflow.com/questions/444380/is-there-a-streaming-api-for-json) *for Python*, which I can't find a duplicate of on SO, so I think it's a legit question. – Michael Kropat Feb 11 '14 at 17:32
  • @user3281420: As it turns out, there is no answer that handles this specific case. But that was only apparent after you updated the post a little. :-) – Martijn Pieters Feb 11 '14 at 18:58

3 Answers3

39

Generally speaking, putting more than one JSON object into a file makes that file invalid, broken JSON. That said, you can still parse data in chunks using the JSONDecoder.raw_decode() method.

The following will yield complete objects as the parser finds them:

from json import JSONDecoder
from functools import partial


def json_parse(fileobj, decoder=JSONDecoder(), buffersize=2048):
    buffer = ''
    for chunk in iter(partial(fileobj.read, buffersize), ''):
         buffer += chunk
         while buffer:
             try:
                 result, index = decoder.raw_decode(buffer)
                 yield result
                 buffer = buffer[index:].lstrip()
             except ValueError:
                 # Not enough data to decode, read more
                 break

This function will read chunks from the given file object in buffersize chunks, and have the decoder object parse whole JSON objects from the buffer. Each parsed object is yielded to the caller.

Use it like this:

with open('yourfilename', 'r') as infh:
    for data in json_parse(infh):
        # process object

Use this only if your JSON objects are written to a file back-to-back, with no newlines in between. If you do have newlines, and each JSON object is limited to a single line, you have a JSON Lines document, in which case you can use Loading and parsing a JSON file with multiple JSON objects in Python instead.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • This worked great, thank you. Yes, the file i was dealing with had back to back JSON objects. Also for the try/except, I used 'pass' instead of 'break'. Was the break intentional? I couldn't get it to work with it. – Cam Feb 11 '14 at 19:02
  • @user3281420: yes, the `break` was intentional; it breaks the `while` loop so we move on the next chunk read from the file. The `break` is only triggered if there is no JSON object to decode in the current buffer. – Martijn Pieters Feb 11 '14 at 19:04
  • @user3281420: `pass` would only work if the buffer was empty as that is the other termination condition for the `while` loop. – Martijn Pieters Feb 11 '14 at 19:05
  • @user3281420: If you are willing to share a file that doesn't work when `break` is used, I'd love to see if I can debug it. Have a dropbox link for me perhaps? – Martijn Pieters Feb 11 '14 at 19:05
  • Hrm perhaps I gave a bad example of the data. Sorry, I don't think I can share this data. When I was using a break the script would exit before doing any of the processing on the json data. Thanks for the code though. I learned a lot just trying to figure out what It was doing. – Cam Feb 11 '14 at 19:24
  • non standard way is http://stackoverflow.com/questions/21855877/parse-2-json-strings-in-python/21856188 which with some editing migt be useful to get rid of rubbish data between the jsons (new lines / any separators) – Rami Dabain Feb 18 '14 at 15:52
  • 2
    @RonanDejhero: it's easy enough to add a `.strip()` call to the `buffer`: `result, index = decoder.raw_decode(buffer.strip())` to remove whitespace, `buffer.strip(' \n|')` to remove an explicit set of characters. – Martijn Pieters Feb 18 '14 at 15:54
  • How do you extract a key-value from this 'data' variable? – user2441441 Mar 09 '15 at 19:15
  • 1
    @user2441441: `data` is the JSON data decoded to a Python object. It depends on what JSON object you decoded how to get key-value pairs from it. – Martijn Pieters Mar 09 '15 at 19:16
  • @MartijnPieters' comment suggesting `decoder.raw_decode(buffer.strip())` will produce an incorrect index for `buffer = buffer[index:]`. You need `buffer = buffer.lstrip()` before calling raw_decode` – Willem Dec 21 '20 at 14:03
  • @Willem: ah, indeed, the index would be into the stripped buffer, not the original. – Martijn Pieters Dec 24 '20 at 16:20
7

Here is a slight modification of Martijn Pieters' solution, which will handle JSON strings separated with whitespace.

def json_parse(fileobj, decoder=json.JSONDecoder(), buffersize=2048, 
               delimiters=None):
    remainder = ''
    for chunk in iter(functools.partial(fileobj.read, buffersize), ''):
        remainder += chunk
        while remainder:
            try:
                stripped = remainder.strip(delimiters)
                result, index = decoder.raw_decode(stripped)
                yield result
                remainder = stripped[index:]
            except ValueError:
                # Not enough data to decode, read more
                break

For example, if data.txt contains JSON strings separated by a space:

{"business_id": "1", "Accepts Credit Cards": true, "Price Range": 1, "type": "food"} {"business_id": "2", "Accepts Credit Cards": true, "Price Range": 2, "type": "cloth"} {"business_id": "3", "Accepts Credit Cards": false, "Price Range": 3, "type": "sports"}

then

In [47]: list(json_parse(open('data')))
Out[47]: 
[{u'Accepts Credit Cards': True,
  u'Price Range': 1,
  u'business_id': u'1',
  u'type': u'food'},
 {u'Accepts Credit Cards': True,
  u'Price Range': 2,
  u'business_id': u'2',
  u'type': u'cloth'},
 {u'Accepts Credit Cards': False,
  u'Price Range': 3,
  u'business_id': u'3',
  u'type': u'sports'}]
Community
  • 1
  • 1
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
6

If your JSON documents contains a list of objects, and you want to read one object one-at-a-time, you can use the iterative JSON parser ijson for the job. It will only read more content from the file when it needs to decode the next object.

Note that you should use it with the YAJL library, otherwise you will likely not see any performance increase.

That being said, unless your file is really big, reading it completely into memory and then parsing it with the normal JSON module will probably still be the best option.

poke
  • 369,085
  • 72
  • 557
  • 602