92

I have some json files with 500MB. If I use the "trivial" json.load() to load its content all at once, it will consume a lot of memory.

Is there a way to read partially the file? If it was a text, line delimited file, I would be able to iterate over the lines. I am looking for analogy to it.

martineau
  • 119,623
  • 25
  • 170
  • 301
duduklein
  • 10,014
  • 11
  • 44
  • 55
  • The problem I am facing is that I have 195 files like that to process and it seems that python's garbage collector is not doing a good job. After the 10th file, I run out of memory. I'm using Python 2.6.4 on windows 7. I have 3GB ram memory – duduklein Mar 08 '10 at 11:13
  • 1
    Why do you need to load all of them into memory at once? That seems ineffective. – S.Lott Mar 08 '10 at 11:36
  • I don't have to load all of them at once, but it seems that the garbage collector is not working well. It consumes a lor of memory after many files are closed. When I iterate over the files, the json object always has the same variable name and I assume that the garbage collector should free the memory that the other files occupied. But this just does not happeb – duduklein Mar 08 '10 at 20:10
  • 1
    @user210481: "assume that the garbage collector should free the memory" It should. Since it doesn't, something else is wrong. – S.Lott Mar 09 '10 at 03:08
  • @user210481: Show us your code! – John Machin Mar 09 '10 at 05:51
  • @vtd-xml-author does it make sense? – synhershko Mar 17 '11 at 08:11
  • 1
    The answer by @Jim Pivarski should be the accepted one. – 0 _ Feb 28 '16 at 03:58
  • http://stackoverflow.com/questions/10382253/reading-rather-large-json-files-in-python – Christophe Roussy Jul 05 '16 at 12:07

11 Answers11

102

There was a duplicate to this question that had a better answer. See https://stackoverflow.com/a/10382359/1623645, which suggests ijson.

Update:

I tried it out, and ijson is to JSON what SAX is to XML. For instance, you can do this:

import ijson
for prefix, the_type, value in ijson.parse(open(json_file_name)):
    print(prefix, the_type, value)

where prefix is a dot-separated index in the JSON tree (what happens if your key names have dots in them? I guess that would be bad for Javascript, too...), theType describes a SAX-like event, one of 'null', 'boolean', 'number', 'string', 'map_key', 'start_map', 'end_map', 'start_array', 'end_array', and value is the value of the object or None if the_type is an event like starting/ending a map/array.

The project has some docstrings, but not enough global documentation. I had to dig into ijson/common.py to find what I was looking for.

Jim Pivarski
  • 5,568
  • 2
  • 35
  • 47
  • 16
    I found this not only the best response to the question, but the most useful introduction to ijson I could find after much googling. Thank you for taking the time to go through the sparse documentation and presenting its base fuctionality so simply and clearly. – prooffreader Apr 18 '14 at 09:44
  • 5
    Nice link. There is another ijson feature - generator generating dictionaries on given place in JSON data. Comparing execution time with other solutions, ijson is rather slow (57 s versus stdlib json), but it is excellent if you need to keep memory consumption low (13 MB versus stdlib json 439 MB). Using with yajl2 backend, it was not faster, but memory consumption dropped down to 5 MB. Tested on 3 files each being about 30 MB large and having 300 thousands records. – Jan Vlcinsky Oct 03 '14 at 21:47
  • For multiple top-level values (multi json), use this parameter, `multiple_values=True`. so, `for prefix, the_type, value in ijson.parse(open(json_file_name), multiple_values=True)` – Mehrdad Salimi Jun 11 '23 at 15:08
  • If your file contains many small JSON documents separated by newlines (and has no unquoted newlines within each JSON document), then the standard JSON parser has performance advantages: `for jsondoc in open("newline-separated.jsons"): print(json.loads(jsondoc))`. – Jim Pivarski Jun 12 '23 at 14:09
18

So the problem is not that each file is too big, but that there are too many of them, and they seem to be adding up in memory. Python's garbage collector should be fine, unless you are keeping around references you don't need. It's hard to tell exactly what's happening without any further information, but some things you can try:

  1. Modularize your code. Do something like:

    for json_file in list_of_files:
        process_file(json_file)
    

    If you write process_file() in such a way that it doesn't rely on any global state, and doesn't change any global state, the garbage collector should be able to do its job.

  2. Deal with each file in a separate process. Instead of parsing all the JSON files at once, write a program that parses just one, and pass each one in from a shell script, or from another python process that calls your script via subprocess.Popen. This is a little less elegant, but if nothing else works, it will ensure that you're not holding on to stale data from one file to the next.

Hope this helps.

jcdyer
  • 18,616
  • 5
  • 42
  • 49
8

Yes.

You can use jsonstreamer SAX-like push parser that I have written which will allow you to parse arbitrary sized chunks, you can get it here and checkout the README for examples. Its fast because it uses the 'C' yajl library.

keios
  • 462
  • 4
  • 10
5

It can be done by using ijson. The working of ijson has been very well explained by Jim Pivarski in the answer above. The code below will read a file and print each json from the list. For example, file content is as below

[{"name": "rantidine",  "drug": {"type": "tablet", "content_type": "solid"}},
{"name": "nicip",  "drug": {"type": "capsule", "content_type": "solid"}}]

You can print every element of the array using the below method

 def extract_json(filename):
      with open(filename, 'rb') as input_file:
          jsonobj = ijson.items(input_file, 'item')
          jsons = (o for o in jsonobj)
          for j in jsons:
             print(j)

Note: 'item' is the default prefix given by ijson.

if you want to access only specific json's based on a condition you can do it in following way.

def extract_tabtype(filename):
    with open(filename, 'rb') as input_file:
        objects = ijson.items(input_file, 'item.drugs')
        tabtype = (o for o in objects if o['type'] == 'tablet')
        for prop in tabtype:
            print(prop)

This will print only those json whose type is tablet.

ak1234
  • 201
  • 2
  • 10
3

On your mention of running out of memory I must question if you're actually managing memory. Are you using the "del" keyword to remove your old object before trying to read a new one? Python should never silently retain something in memory if you remove it.

Aea
  • 976
  • 8
  • 10
  • I'm not using the del command, since I thouth it did it automatically, because there were no more references to it. – duduklein Mar 08 '10 at 20:12
  • 2
    Since it wasn't removed, you still have references. Global variables are the usual problem. – S.Lott Mar 09 '10 at 03:09
2

Update

See the other answers for advice.

Original answer from 2010, now outdated

Short answer: no.

Properly dividing a json file would take intimate knowledge of the json object graph to get right.

However, if you have this knowledge, then you could implement a file-like object that wraps the json file and spits out proper chunks.

For instance, if you know that your json file is a single array of objects, you could create a generator that wraps the json file and returns chunks of the array.

You would have to do some string content parsing to get the chunking of the json file right.

I don't know what generates your json content. If possible, I would consider generating a number of managable files, instead of one huge file.

codeape
  • 97,830
  • 24
  • 159
  • 188
  • Unfortunately, I can't post the file here and it's not generated by me either. I was thinking about reading the json file with the regular json.load and generate a new text, line delimited file to iterate over it. The problem I am facing is that I have 195 files like that to process and it seems that python's garbage collector is not doing a good job. After the 10th file, I run out of memory. I'm using Python 2.6.4 on windows 7. – duduklein Mar 08 '10 at 11:13
  • It would be cool if there was a SAX-like JSON api for Python. Like JACKSON for Java. – Warren P Feb 01 '11 at 20:56
  • 3
    It's unfortunate that this answer has been accepted since there are existing and working Python incremental json parsers... – bruno desthuilliers Apr 22 '20 at 07:01
  • 1
    I tried to delete the answer, but that doesn't work with accepted answers. Will edit. – codeape Apr 22 '20 at 10:51
  • @brunodesthuilliers do you have a suggestion of incremental parsing when the json is one huge string in `index` format? See my [question](https://stackoverflow.com/questions/61800463/read-large-json-file-with-index-format-into-pandas-dataframe). – MattSom May 14 '20 at 16:41
  • @duduklein can you change the accepted answer. – Phani Rithvij Jan 24 '21 at 10:50
2

Another idea is to try load it into a document-store database like MongoDB. It deals with large blobs of JSON well. Although you might run into the same problem loading the JSON - avoid the problem by loading the files one at a time.

If path works for you, then you can interact with the JSON data via their client and potentially not have to hold the entire blob in memory

http://www.mongodb.org/

George Godik
  • 1,716
  • 1
  • 14
  • 19
2

"the garbage collector should free the memory"

Correct.

Since it doesn't, something else is wrong. Generally, the problem with infinite memory growth is global variables.

Remove all global variables.

Make all module-level code into smaller functions.

S.Lott
  • 384,516
  • 81
  • 508
  • 779
1

in addition to @codeape

I would try writing a custom json parser to help you figure out the structure of the JSON blob you are dealing with. Print out the key names only, etc. Make a hierarchical tree and decide (yourself) how you can chunk it. This way you can do what @codeape suggests - break the file up into smaller chunks, etc

George Godik
  • 1,716
  • 1
  • 14
  • 19
0

You can parse the JSON file to CSV file and you can parse it line by line:

import ijson
import csv


def convert_json(self, file_path):
    did_write_headers = False
    headers = []
    row = []

    iterable_json = ijson.parse(open(file_path, 'r'))

    with open(file_path + '.csv', 'w') as csv_file:
        csv_writer = csv.writer(csv_file, ',', '"', csv.QUOTE_MINIMAL)

        for prefix, event, value in iterable_json:
            if event == 'end_map':
                if not did_write_headers:
                    csv_writer.writerow(headers)
                did_write_headers = True
                csv_writer.writerow(row)
                row = []
            if event == 'map_key' and not did_write_headers:
                headers.append(value)
            if event == 'string':
                row.append(value)
Alon Barad
  • 1,491
  • 1
  • 13
  • 26
0

So simply using json.load() will take a lot of time. Instead, you can load the json data line by line using key and value pair into a dictionary and append that dictionary to the final dictionary and convert it to pandas DataFrame which will help you in further analysis

def get_data():
    with open('Your_json_file_name', 'r') as f:
        for line in f:
            yield line


data = get_data()
data_dict = {}
each = {}


for line in data:
    each = {}
     # k and v are the key and value pair 
    for k, v in json.loads(line).items():
        #print(f'{k}: {v}')
        each[f'{k}'] = f'{v}' 
    data_dict[i] = each
Data = pd.DataFrame(data_dict)
#Data will give you the dictionary data in dataFrame (table format) but it will 
 #be in transposed form , so will then finally transpose the dataframe as ->
Data_1 = Data.T