72

I'd like to know if there is a memory efficient way of reading multi record JSON file ( each line is a JSON dict) into a pandas dataframe. Below is a 2 line example with working solution, I need it for potentially very large number of records. Example use would be to process output from Hadoop Pig JSonStorage function.

import json
import pandas as pd

test='''{"a":1,"b":2}
{"a":3,"b":4}'''
#df=pd.read_json(test,orient='records') doesn't work, expects []

l=[ json.loads(l) for l in test.splitlines()]
df=pd.DataFrame(l)
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
seanv507
  • 1,206
  • 1
  • 11
  • 23
  • Use the chunksize attribute of pd.read_json to get a list of dataframes and use map or loop to iterate over the dataframes – devssh Jun 07 '18 at 05:34

4 Answers4

113

Note: Line separated json is now supported in read_json (since 0.19.0):

In [31]: pd.read_json('{"a":1,"b":2}\n{"a":3,"b":4}', lines=True)
Out[31]:
   a  b
0  1  2
1  3  4

or with a file/filepath rather than a json string:

pd.read_json(json_file, lines=True)

It's going to depend on the size of you DataFrames which is faster, but another option is to use str.join to smash your multi line "JSON" (Note: it's not valid json), into valid json and use read_json:

In [11]: '[%s]' % ','.join(test.splitlines())
Out[11]: '[{"a":1,"b":2},{"a":3,"b":4}]'

For this tiny example this is slower, if around 100 it's the similar, signicant gains if it's larger...

In [21]: %timeit pd.read_json('[%s]' % ','.join(test.splitlines()))
1000 loops, best of 3: 977 µs per loop

In [22]: %timeit l=[ json.loads(l) for l in test.splitlines()]; df = pd.DataFrame(l)
1000 loops, best of 3: 282 µs per loop

In [23]: test_100 = '\n'.join([test] * 100)

In [24]: %timeit pd.read_json('[%s]' % ','.join(test_100.splitlines()))
1000 loops, best of 3: 1.25 ms per loop

In [25]: %timeit l = [json.loads(l) for l in test_100.splitlines()]; df = pd.DataFrame(l)
1000 loops, best of 3: 1.25 ms per loop

In [26]: test_1000 = '\n'.join([test] * 1000)

In [27]: %timeit l = [json.loads(l) for l in test_1000.splitlines()]; df = pd.DataFrame(l)
100 loops, best of 3: 9.78 ms per loop

In [28]: %timeit pd.read_json('[%s]' % ','.join(test_1000.splitlines()))
100 loops, best of 3: 3.36 ms per loop

Note: of that time the join is surprisingly fast.

Marco Castelluccio
  • 10,152
  • 2
  • 33
  • 48
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • Not including time to read in the string (which both solutions use), I wonder if there's a trick here... – Andy Hayden Nov 18 '13 at 02:18
  • 1
    I had to add lines=True as in `data = pd.read_json('/path/to/file.json', lines=True)` – ShahNewazKhan May 23 '18 at 16:13
  • It seems to not work for big json files, if an error happens in the json. – devssh Jun 06 '18 at 10:00
  • @devssh please post an issue on GitHub – Andy Hayden Jun 06 '18 at 15:15
  • 1
    So, there is a closed issue for this on Github https://github.com/pandas-dev/pandas/issues/18152 I validated that the multiple json in the big file are not invalid and loaded them successfully in chunks. `dfs = pd.read_json('file', lines=True, chunksize=x)` and `for df in dfs: df.head()` – devssh Jun 07 '18 at 05:32
28

If you are trying to save memory, then reading the file a line at a time will be much more memory efficient:

with open('test.json') as f:
    data = pd.DataFrame(json.loads(line) for line in f)

Also, if you import simplejson as json, the compiled C extensions included with simplejson are much faster than the pure-Python json module.

Doctor J
  • 5,974
  • 5
  • 44
  • 40
  • 2
    Actually I think the first thing the DataFrame constructor does is call list on a generator like this, so both memory and timings will be the same. Performance of simplejson lies somewhere between pandas' read_json and json. – Andy Hayden Mar 04 '14 at 23:39
  • 1
    Ah, that's too bad; seems you are right about the `DataFrame` constructor. And recent versions of Python include compiled C extensions for the builtin `json`. Fortunately as of Pandas 0.19, you can use `read_json(lines=True)`. – Doctor J Sep 18 '16 at 23:48
  • @AndyHayden: This would still save memory over the OP's `l=[ json.loads(l) for l in test.splitlines()]`, which needs to have, in memory, all at once: 1. The original complete file data, 2. The file data split into lines (deleted once all lines parsed), and 3. The parsed JSON objects. Reading lazily and loading line by line means you only have #3 from the above, plus one (technically two, but logically one) line of the file in memory at once. Sure, all the parsed objects are in memory, but not two extra copies of the file data to boot. – ShadowRanger Nov 29 '17 at 17:00
  • @ShadowRanger no, the first thing the DataFrame constructor does is apply `list` to the iterator. It's completely equivalent. – Andy Hayden Nov 29 '17 at 18:45
  • @AndyHayden: It would be equivalent if the iterable being iterated were equivalent, but the OP's iterable is `test.splitlines()` (meaning `test` and the `list` of resulting lines are all held in memory while the `list` is built), while Doctor J's iterable is `f` (an open file object), which pulls each line as it goes, replacing it immediately after each `json.loads`. `pd.DataFrame(json.loads(line) for line in f)` and `pd.DataFrame([json.loads(line) for line in f])` *would* be equivalent (former `list`ified by `DataFrame`, latter makes `list` directly), but file vs. `str.split` differs. – ShadowRanger Nov 29 '17 at 18:52
  • Point is, Doctor J's solution is an improvement because it doesn't slurp the file into memory up front; it processes it as it reads, so peak memory usage is just the `list` of parsed JSON objects (plus the size of the two largest lines in the file), not the `list` of JSON objects, plus the complete file contents, plus the complete file contents again (split into lines). – ShadowRanger Nov 29 '17 at 18:55
  • @ShadowRanger except it's creating a python dict for each line, which is always going to be much larger in memory than the string (Hence from the timing results for 1000 rows in my answer) – Andy Hayden Nov 29 '17 at 19:02
  • @AndyHayden: True. I never claimed it wasn't going to have that, just that it wasn't going to compound that problem by storing multiple copies of the complete file bytes in memory. You could reduce the overhead of the `list` of `dict`s though, combining this approach w/`collections.namedtuple`. `Row = namedtuple('Row', ('a', 'b'))`, then replace the definition of `data` with `data = pd.DataFrame(Row(**json.loads(line)) for line in f)`. Now, instead of #lines `dict`s, you have #lines `Row`s, which run (on my system) 64 bytes of overhead each, rather than 240, roughly 1/4 the memory overhead. – ShadowRanger Nov 29 '17 at 19:22
  • Mind you, the `namedtuple` approach does require that all the JSON objects have the same set of attributes (or that complex approaches that provide defaults for optional values be used), so it's not suited to all circumstances, but the temporaries are a lot less heavyweight this way. – ShadowRanger Nov 29 '17 at 19:25
19

As of Pandas 0.19, read_json has native support for line-delimited JSON:

pd.read_json(jsonfile, lines=True)
Doctor J
  • 5,974
  • 5
  • 44
  • 40
5

++++++++Update++++++++++++++

As of v0.19, Pandas supports this natively (see https://github.com/pandas-dev/pandas/pull/13351). Just run:

df=pd.read_json('test.json', lines=True)

++++++++Old Answer++++++++++

The existing answers are good, but for a little variety, here is another way to accomplish your goal that requires a simple pre-processing step outside of python so that pd.read_json() can consume the data.

  • Install jq https://stedolan.github.io/jq/.
  • Create a valid json file with cat test.json | jq -c --slurp . > valid_test.json
  • Create dataframe with df=pd.read_json('valid_test.json')

In ipython notebook, you can run the shell command directly from the cell interface with

!cat test.json | jq -c --slurp . > valid_test.json
df=pd.read_json('valid_test.json')
Bob Baxley
  • 3,551
  • 1
  • 22
  • 28