Load an element with python from large json file

Question

So, here is my json file. I want to load the data list from it, one by one, and only it. And then, for exemple plot it...

This is an exemple, because I am dealing with large data set, with wich I could not load all the file (that would create a memory error).

{
  "earth": {
    "europe": [
      {"name": "Paris", "type": "city"},
      {"name": "Thames", "type": "river"}, 
      {"par": 2, "data": [1,7,4,7,5,7,7,6]}, 
      {"par": 2, "data": [1,0,4,1,5,1,1,1]}, 
      {"par": 2, "data": [1,0,0,0,5,0,0,0]}
        ],
    "america": [
      {"name": "Texas", "type": "state"}
    ]
  }
}

Here is what I tried:

import ijson
filename = "testfile.json"

f = open(filename)
mylist = ijson.items(f, 'earth.europe[2].data.item')
print mylist

It returns me nothing, even when I try to convert it into a list:

[]

I didn't put the code I used, because I don't think that a good way to do... import ijson as ijson filename = "myfile.json" with open(myfile,'r') as f: voila=ijson.items(f,'earth.data.item') print voila — Agape Gal'lo, Oct 30 '16 at 15:58

Martijn Pieters · Accepted Answer · 2016-11-02T18:22:53.487

3

You need to specify a valid prefix; ijson prefixes are either keys in a dictionary or the word item for list entries. You can't select a specific list item (so [2] doesn't work).

If you wanted all the data keys dictionaries in the europe list, then the prefix is:

earth.europe.item.data
# ^ ------------------- outermost key must be 'earth'
#       ^ ------------- next key must be 'europe'
#              ^ ------ any value in the array
#                   ^   the value for the 'data' key

This produces each such list:

>>> l = ijson.items(f, 'earth.europe.item.data')
>>> for data in l:
...     print data
...
[1, 7, 4, 7, 5, 7, 7, 6]
[1, 0, 4, 1, 5, 1, 1, 1]
[1, 0, 0, 0, 5, 0, 0, 0]

You can't put wildcards in that, so you can't get earth.*.item.data for example.

If you need to do more complex prefixing matching, you'd have to use the ijson.parse() function and handle the events this produces. You can reuse the ijson.ObjectBuilder() class to turn events you are interested in into Python objects:

parser = ijson.parse(f)
for prefix, event, value in parser:
    if event != 'start_array':
        continue
    if prefix.startswith('earth.') and prefix.endswith('.item.data'):
        continent = prefix.split('.', 2)[1]
        builder = ijson.ObjectBuilder()
        builder.event(event, value)
        for nprefix, event, value in parser:
            if (nprefix, event) == (prefix, 'end_array'):
                break
            builder.event(event, value)
        data = builder.value
        print continent, data

This will print every array that's in a list under a 'data' key (so lives under a prefix that ends with '.item.data'), with the 'earth' key. It also extracts the continent key.

edited Nov 02 '16 at 18:22

answered Nov 02 '16 at 18:02

Martijn Pieters

1,048,767
296
4,058
3,343

Thanks a lot! Even if I will have to focus a little bit for the second part, that's the best explanation I found on internet :)) – Agape Gal'lo Nov 03 '16 at 11:03
I think you have answered to this question, but is there any way to load those data one by one? Because, if I want to treat them, I have to store them (for example) in a list. And the same problem happens again: a "memory error". Any idea? – Agape Gal'lo Nov 03 '16 at 15:10
@JeanneDiderot: yes, where I use `print` right now, you can treat just that one list, then discard it. Or you could wrap the whole thing in to a function, use `yield continent, data` to have it produce each `data` list one by one as you iterate, and again if you then don't add more references to the list it'll be cleared again. – Martijn Pieters Nov 03 '16 at 15:16
Ok, it works! But very slow... I guess that because of the file's size. But one thing is strange: if I want to load ONE element (for exemple Paris), it will be very very slow (as for a long array). And more generally, even if your explications are good, ijson don't seems very fast... – Agape Gal'lo Nov 03 '16 at 18:48
@JeanneDiderot: the default backend for `ijson` is the pure-python parser, which is slow. Install YAJL 2.x and use `import ijson.backends.yajl2_cffi as ijson` to import a much, much faster backend. – Martijn Pieters Nov 03 '16 at 18:50
@JeanneDiderot: see http://lloyd.github.io/yajl/; different platforms may already have an installable package available. I used `brew install yajl` on my Mac. – Martijn Pieters Nov 03 '16 at 18:51
I have a win 10. Do I have to install Git? And then enter: $ git clone git://github.com/lloyd/yajl ? – Agape Gal'lo Nov 03 '16 at 19:02
@JeanneDiderot: there is a ready Windows binary here: https://github.com/LearningRegistry/LearningRegistry/wiki/Windows-Installation-Guide#yajl. No idea what version that is. There may be others. – Martijn Pieters Nov 03 '16 at 19:03
@JeanneDiderot: do experiment with the different [documented backends](https://github.com/isagalaev/ijson/#backends); if you can only get yajl 1.x, then that'll still be faster than the pure-Python version. If you can get 2.x, *and* you can install the [`cffi` package](http://cffi.readthedocs.io/en/latest/installation.html), then you get the fastest option of all, however. – Martijn Pieters Nov 03 '16 at 19:12
You will get tired with me... I load the "yajl-2.1.0.zip". I don't succed in intalling this... and "brew install yajl" is not recognize as an intern command. :(( BUTI succeed to install cffi !! – Agape Gal'lo Nov 03 '16 at 19:21
@JeanneDiderot: perhaps you need to start asking a question on [su] then. `brew` wouldn't work, that's a Mac OS X tool. You'll either need to compile the project (no need to install git, there are [download links](https://lloyd.github.io/yajl/), but you *would* need Visual Studio), or you need to find a compiled version (like the zip file) and install that in the right location. What the right location is, I don't know, I don't use Windows, sorry. – Martijn Pieters Nov 03 '16 at 19:23
Ok, I [did](http://superuser.com/questions/1142121/install-yajl-2-x-on-windows-10). No idea of an other way to read and treat data, in a faster way? – Agape Gal'lo Nov 03 '16 at 19:47
@AgapeGal'lo: sorry, I'm not aware of other options than to break up your data set into smaller JSON files by some other means or to use a streaming parser, for which on Python all libraries that support this use `yajl`. – Martijn Pieters Nov 03 '16 at 19:59
I tested again my program. The difficulty is may be not where I thought. That's very strange (or may be interesting...). The organisation is kind of the same than in the example. There are about 800 points in each "data", but loading the "2" of the dictionary "par" take me 130 more time!! I use this code: object= ijson.items(f, 'earth.europe.par') for i in object: speed = np.float(list(object)[0]) #As there is only one element it works But more the file is big, more it takes time (but in a non reasonable way...) to extract this single float! – Agape Gal'lo Nov 03 '16 at 22:09

score 0 · Answer 2 · answered Oct 30 '16 at 18:51

0

Given the structure of your json I would do this:

import json

filename = "test.json"

with open(filename) as data_file:
    data = json.load(data_file)
print data['earth']['europe'][2]['data']
print type(data['earth']['europe'][2]['data'])

answered Oct 30 '16 at 18:51

ASMateus

71
5

No, I want to load only the data lists from the json file; not all of the json file. The problem is that I have a 500 Mo file and python return me a "memory error" when I try to load everything. – Agape Gal'lo Nov 02 '16 at 13:41

score 0 · Answer 3 · answered Nov 08 '16 at 14:53

0

So, I will explain how I finally solved this problem. The first answer will work. But you have to know that loading elements one per one with ijson will be very long... and by the end, you do not have the loaded file.

So, the important information is that windows limit your memory per process to 2 or 4 GB, depending on wich windows you use (32 or 64). If you use pythonxy, that will be 2 GB (it only exists in 32). Anyway, in both way, that's very very low!

I solved this problem by installing a virtual Linux in my windows, and it works. Here are the main step to do so:

Install Virtual Box
Install Ubuntu (for exemple)
Install python for scientist on your computer, like SciPy
Create a share file between the 2 "computers" (you will find tutorial on google)
Execute your code on your ubuntu "computer": it sould work ;)

NB: Do not forget to allow sufficient RAM and memory to you virtual computer.

This works for me. I don't have anymore this "memory error" problem.

answered Nov 08 '16 at 14:53

Agape Gal'lo

687
4
9
23

No, not until you have a larger JSON file still. Streamed parsing is still the better option. If you were willing to install a virtual machine with Linux for this, why not also try to use the ijson with yajl as the backend? – Martijn Pieters Nov 08 '16 at 15:32
Because the BIG advantage with this method is that at the end you have the file loaded. And often, when you do data processing, you want to modify the parameters of the analysis, and it goes much faster if you have the file already loaded. After, **if the file is really too big** (more than a few GB), I would clearly recommend your method. But my files are "only 1-2 GB... And I think it's the case of many people who ask this question. – Agape Gal'lo Nov 08 '16 at 17:57
At any rate, this isn't really an answer to your question posted *here*, which appeared to concern the use of the ijson library. You are answering the question 'how to load a large JSON file', which is a problem that may have *led* to the actual question posted. :-) – Martijn Pieters Nov 08 '16 at 17:59
1

You clearly right! I re-put your answer as the best one ;) That's clearly the most complete! – Agape Gal'lo Nov 08 '16 at 17:59
Thanks, that's much appreciated. Not just for me, but also for future visitors that *probably* come here to see how to use ijson specifically. :-) – Martijn Pieters Nov 08 '16 at 18:00

Load an element with python from large json file

3 Answers3

Linked