How to speed up process of loading and reading JSON files in Python?

Question

I am running a script (in multiprocessing mode) that extract some parameters from a bunch of JSON files but currently it is very slow. Here is the script:

from __future__ import print_function, division
import os
from glob import glob
from os import getpid
from time import time
from sys import stdout
import resource
from multiprocessing import Pool
import subprocess
try:
    import simplejson as json
except ImportError:
    import json


path = '/data/data//*.A.1'
print("Running with PID: %d" % getpid())

def process_file(file):
    start = time()
    filename =file.split('/')[-1]
    print(file)
    with open('/data/data/A.1/%s_DI' %filename, 'w') as w:
        with open(file, 'r') as f:
            for n, line in enumerate(f):
                d = json.loads(line)
                try:

                    domain = d['rrname']
                    ips = d['rdata']
                    for i in ips:
                        print("%s|%s" % (i, domain), file=w)
                except:
                    print (d)
                    pass

if __name__ == "__main__":
    files_list = glob(path)
    cores = 12
    print("Using %d cores" % cores)
    pp = Pool(processes=cores)
    pp.imap_unordered(process_file, files_list)
    pp.close()
    pp.join()

Does any body know how to speed this up?

@kevin am a beginner in programming and just started with python. don't know that much of other languages. Do you have any suggestion for a faster language? — UserYmY, Dec 10 '14 at 17:43
maybe your strategy from the beginning is wrong: a serious DB is meant to solve your problem. — Jason Hu, Dec 10 '14 at 17:45
Have you profiled this to ensure that the json.loads() is actually the thing taking all the time? — kdopen, Dec 10 '14 at 17:49
@kdopen that is what I guess cuz the rest is just opening and writing into the file — UserYmY, Dec 10 '14 at 17:50
But you have multiple threads and a lot of I/O. On modern machines, you will largely be I/O bound — kdopen, Dec 10 '14 at 17:56
@kdopen am running it on a server with multiple cores. am guessing IO would not be the problem here — UserYmY, Dec 10 '14 at 18:02
How slow is slow? You could get the size of all files and divide by runtime to get bandwidth. You have a lot of print statements - is the program dumping a bunch of stuff to the screen? That would be your problem. "I'm guessing IO would not be the problem here" - unless you have a lot of other processing or are using an SSD, this is a disk bound operation. — tdelaney, Dec 10 '14 at 18:17
@tdelaney it is 0.04 G per minute. am not writing on ssd and am not getting anything printed on the screen yet rather than file names (11 lines) — UserYmY, Dec 10 '14 at 18:26
I did some experiments with your code, assuming you have lots of small json objects per line and found out to my surprise that the json parsing was very much cpu bound and I could only get about 28 MB/s on my desktop machine. simplejson was faster than json or ultrajson. — tdelaney, Dec 10 '14 at 19:30
By comparison, you could replace the jsonizing for loop with one that just writes the input line to the output file to get a feel for the i/o bandwidth performance. — tdelaney, Dec 10 '14 at 19:34
@tdelaney thank you for checking it. I do not understand your last comment completely, can you post it as answer in accordance to my code? — UserYmY, Dec 10 '14 at 19:40

Ryabchenko Alexander · Answer 1 · 2022-01-15T09:47:34.727

20

swith from

import json

to

import ujson

https://artem.krylysov.com/blog/2015/09/29/benchmark-python-json-libraries/

or switch to orjson

import orjson

https://github.com/ijl/orjson

edited Jan 15 '22 at 09:47

answered Apr 09 '19 at 13:05

Ryabchenko Alexander

10,057
7
56
88

1

I know this is over two years old, but I just had to comment to say I tried this and its LIGHTNING FAST. Thank you! – Bajan Apr 21 '21 at 16:21

score 13 · Answer 2 · edited May 23 '17 at 12:05

First, find out where your bottlenecks are.

If it is on the json decoding/encoding step, try switching to ultrajson:

UltraJSON is an ultra fast JSON encoder and decoder written in pure C with bindings for Python 2.5+ and 3.

The changes would be as simple as changing the import part:

try:
    import ujson as json
except ImportError:
    try:
        import simplejson as json
    except ImportError:
        import json

I've also done a simple benchmark at What is faster - Loading a pickled dictionary object or Loading a JSON file - to a dictionary?, take a look.

score 0 · Answer 3 · answered Dec 10 '14 at 20:32

I updated the script a bit to try different experiments and found that yes, json parsing is cpu bound. I got 28MB/s, which is better than your .04Gig per minute (> 1 MB/s), so not sure what's going on there. When skipping the json stuff and just writing to the file, I got 996 MB/s.

In the code below, you can generate a dataset with python slow.py create and test several scenarios by changing the code marked todo:. My dataset was only 800 MB, so I/O was absorbed by the RAM cache (run it twice to make sure that the files to read have been cached).

I was surprised that json decode is so cpu intensive.

from __future__ import print_function, division
import os
from glob import glob
from os import getpid
from time import time
from sys import stdout
import resource
from multiprocessing import Pool, cpu_count
import subprocess

# todo: pick your poison
#import json
#import ujson as json
import simplejson as json

import sys

# todo: choose your data path
#path = '/data/data//*.A.1'
#path = '/tmp/mytest'
path = os.path.expanduser('~/tmp/mytest')

# todo: choose your cores
#cores = 12
cores = cpu_count()

print("Running with PID: %d" % getpid())

def process_file(file):
    start = time()
    filename =file.split('/')[-1]
    print(file)
    with open(file + '.out', 'w', buffering=1024*1024) as w:
        with open(file, 'r', buffering=1024*1024) as f:
            for n, line in enumerate(f):

                # todo: for pure bandwidth calculations
                #w.write(line)
                #continue

                try:
                    d = json.loads(line)
                except Exception, e:
                    raise RuntimeError("'%s' in %s: %s" % (str(e), file, line))
                try:

                    domain = d['rrname']
                    ips = d['rdata']
                    for i in ips:
                        print("%s|%s" % (i, domain), file=w)
                except:
                    print (d, 'error')
                    pass
    return os.stat(file).st_size

def create_files(path, files, entries):
    print('creating files')
    extra = [i for i in range(32)]
    if not os.path.exists(path):
        os.makedirs(path)
    for i in range(files):
        fn = os.path.join(path, 'in%d.json' % i)
        print(fn)
        with open(fn, 'w') as fp:
            for j in range(entries):
                json.dump({'rrname':'fred', 
                     'rdata':[str(k) for k in range(10)],
                     'extra':extra},fp)
                fp.write('\n')


if __name__ == "__main__":
    if 'create' in sys.argv:
        create_files(path, 1000, 100000)
        sys.exit(0)
    files_list = glob(os.path.join(path, '*.json'))
    print('processing', len(files_list), 'files in', path)
    print("Using %d cores" % cores)
    pp = Pool(processes=cores)
    total = 0
    start = time()
    for result in pp.imap_unordered(process_file, files_list):
        total += result
    pp.close()
    pp.join()
    delta = time() - start
    mb = total/1000000
    print('%d MB total, %d MB/s' % (mb, mb/delta))

I have a similar optimixation issue. ujson & simplejson are not working for me. Can you have a look if your solution above can be applied. Link: https://stackoverflow.com/q/62905750/12968007 — Abhi, Jul 15 '20 at 05:33

score 0 · Answer 4 · edited May 29 '21 at 17:09

0

For installation:

pip install orjson

For import:

import orjson as json

This works especially if you want to dump or load arrays of large size.

edited May 29 '21 at 17:09

F_Schmidt

902
1
11
32

answered May 29 '21 at 10:58

Ashish Gusain

11

How to speed up process of loading and reading JSON files in Python?

4 Answers4

Linked