How to extract data from a 4GB JSON file?

Question

I've got a 4GB JSON file with the following structure:

{
    rows: [
        { id: 1, names: { first: 'john', last: 'smith' }, dates: ...},
        { id: 2, names: { first: 'tim', middle: ['james', 'andrew'], last: 'wilson' }, dates: ... },
    ]
}

I just want to iterate over all the rows, and for each row, extract the ID, the name and some other details, and write it to a CSV file.

If I try to open the file in the standard way, it just hangs. I've been trying to use IJSON, as follows:

f = open('./myfile.json')
rows = ijson.items(f, 'rows')
for r in rows:
    print r

This works fine on a short extract of the file, but on the big file, it's hanging forever.

I've tried this IJSON method also, which does seem to work on the big 4GB file:

for prefix, the_type, value in ijson.parse(open(fname)):
    print prefix, value

But this seems to print every leaf node in turn, with no concept of each top-level row as a separate item - this gets fiddly quickly for JSON data with arbitrary numbers of leaf nodes. To get an array of all the names, I'd need to do something like:

names = []
name = {}
for prefix, the_type, value in ijson.parse(open(fname)):
    print prefix, value
    name[prefix] = 'value'
    if 'first' in name and 'last' in name and 'middle' in name:
        # This is the last of the leaf nodes, we can add it to our list...
        # except.... how to deal with the fact that middle may not 
        # always be present?
        names.append(name)
        name = {}

Is there any way to iterate over each row (rather than each leaf) in turn, in such a large file?

How to extract data from a 4GB JSON file?

0 Answers0