18

I am running a script (in multiprocessing mode) that extract some parameters from a bunch of JSON files but currently it is very slow. Here is the script:

from __future__ import print_function, division
import os
from glob import glob
from os import getpid
from time import time
from sys import stdout
import resource
from multiprocessing import Pool
import subprocess
try:
    import simplejson as json
except ImportError:
    import json


path = '/data/data//*.A.1'
print("Running with PID: %d" % getpid())

def process_file(file):
    start = time()
    filename =file.split('/')[-1]
    print(file)
    with open('/data/data/A.1/%s_DI' %filename, 'w') as w:
        with open(file, 'r') as f:
            for n, line in enumerate(f):
                d = json.loads(line)
                try:

                    domain = d['rrname']
                    ips = d['rdata']
                    for i in ips:
                        print("%s|%s" % (i, domain), file=w)
                except:
                    print (d)
                    pass

if __name__ == "__main__":
    files_list = glob(path)
    cores = 12
    print("Using %d cores" % cores)
    pp = Pool(processes=cores)
    pp.imap_unordered(process_file, files_list)
    pp.close()
    pp.join()

Does any body know how to speed this up?

UserYmY
  • 8,034
  • 17
  • 57
  • 71
  • Half serious answer: rewrite it in a faster language :-) – Kevin Dec 10 '14 at 17:42
  • @kevin am a beginner in programming and just started with python. don't know that much of other languages. Do you have any suggestion for a faster language? – UserYmY Dec 10 '14 at 17:43
  • maybe your strategy from the beginning is wrong: a serious DB is meant to solve your problem. – Jason Hu Dec 10 '14 at 17:45
  • @HuStmpHrrr do you mean an external db?can you elaborate? – UserYmY Dec 10 '14 at 17:47
  • Have you profiled this to ensure that the json.loads() is actually the thing taking all the time? – kdopen Dec 10 '14 at 17:49
  • @kdopen that is what I guess cuz the rest is just opening and writing into the file – UserYmY Dec 10 '14 at 17:50
  • But you have multiple threads and a lot of I/O. On modern machines, you will largely be I/O bound – kdopen Dec 10 '14 at 17:56
  • @kdopen am running it on a server with multiple cores. am guessing IO would not be the problem here – UserYmY Dec 10 '14 at 18:02
  • How slow is slow? You could get the size of all files and divide by runtime to get bandwidth. You have a lot of print statements - is the program dumping a bunch of stuff to the screen? That would be your problem. "I'm guessing IO would not be the problem here" - unless you have a lot of other processing or are using an SSD, this is a disk bound operation. – tdelaney Dec 10 '14 at 18:17
  • @tdelaney it is 0.04 G per minute. am not writing on ssd and am not getting anything printed on the screen yet rather than file names (11 lines) – UserYmY Dec 10 '14 at 18:26
  • I did some experiments with your code, assuming you have lots of small json objects per line and found out to my surprise that the json parsing was very much cpu bound and I could only get about 28 MB/s on my desktop machine. simplejson was faster than json or ultrajson. – tdelaney Dec 10 '14 at 19:30
  • By comparison, you could replace the jsonizing for loop with one that just writes the input line to the output file to get a feel for the i/o bandwidth performance. – tdelaney Dec 10 '14 at 19:34
  • @tdelaney thank you for checking it. I do not understand your last comment completely, can you post it as answer in accordance to my code? – UserYmY Dec 10 '14 at 19:40

4 Answers4

20

swith from

import json 

to

import ujson

https://artem.krylysov.com/blog/2015/09/29/benchmark-python-json-libraries/

or switch to orjson

import orjson 

https://github.com/ijl/orjson

Ryabchenko Alexander
  • 10,057
  • 7
  • 56
  • 88
  • 1
    I know this is over two years old, but I just had to comment to say I tried this and its LIGHTNING FAST. Thank you! – Bajan Apr 21 '21 at 16:21
13

First, find out where your bottlenecks are.

If it is on the json decoding/encoding step, try switching to ultrajson:

UltraJSON is an ultra fast JSON encoder and decoder written in pure C with bindings for Python 2.5+ and 3.

The changes would be as simple as changing the import part:

try:
    import ujson as json
except ImportError:
    try:
        import simplejson as json
    except ImportError:
        import json

I've also done a simple benchmark at What is faster - Loading a pickled dictionary object or Loading a JSON file - to a dictionary?, take a look.

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
0

I updated the script a bit to try different experiments and found that yes, json parsing is cpu bound. I got 28MB/s, which is better than your .04Gig per minute (> 1 MB/s), so not sure what's going on there. When skipping the json stuff and just writing to the file, I got 996 MB/s.

In the code below, you can generate a dataset with python slow.py create and test several scenarios by changing the code marked todo:. My dataset was only 800 MB, so I/O was absorbed by the RAM cache (run it twice to make sure that the files to read have been cached).

I was surprised that json decode is so cpu intensive.

from __future__ import print_function, division
import os
from glob import glob
from os import getpid
from time import time
from sys import stdout
import resource
from multiprocessing import Pool, cpu_count
import subprocess

# todo: pick your poison
#import json
#import ujson as json
import simplejson as json

import sys

# todo: choose your data path
#path = '/data/data//*.A.1'
#path = '/tmp/mytest'
path = os.path.expanduser('~/tmp/mytest')

# todo: choose your cores
#cores = 12
cores = cpu_count()

print("Running with PID: %d" % getpid())

def process_file(file):
    start = time()
    filename =file.split('/')[-1]
    print(file)
    with open(file + '.out', 'w', buffering=1024*1024) as w:
        with open(file, 'r', buffering=1024*1024) as f:
            for n, line in enumerate(f):

                # todo: for pure bandwidth calculations
                #w.write(line)
                #continue

                try:
                    d = json.loads(line)
                except Exception, e:
                    raise RuntimeError("'%s' in %s: %s" % (str(e), file, line))
                try:

                    domain = d['rrname']
                    ips = d['rdata']
                    for i in ips:
                        print("%s|%s" % (i, domain), file=w)
                except:
                    print (d, 'error')
                    pass
    return os.stat(file).st_size

def create_files(path, files, entries):
    print('creating files')
    extra = [i for i in range(32)]
    if not os.path.exists(path):
        os.makedirs(path)
    for i in range(files):
        fn = os.path.join(path, 'in%d.json' % i)
        print(fn)
        with open(fn, 'w') as fp:
            for j in range(entries):
                json.dump({'rrname':'fred', 
                     'rdata':[str(k) for k in range(10)],
                     'extra':extra},fp)
                fp.write('\n')


if __name__ == "__main__":
    if 'create' in sys.argv:
        create_files(path, 1000, 100000)
        sys.exit(0)
    files_list = glob(os.path.join(path, '*.json'))
    print('processing', len(files_list), 'files in', path)
    print("Using %d cores" % cores)
    pp = Pool(processes=cores)
    total = 0
    start = time()
    for result in pp.imap_unordered(process_file, files_list):
        total += result
    pp.close()
    pp.join()
    delta = time() - start
    mb = total/1000000
    print('%d MB total, %d MB/s' % (mb, mb/delta))
tdelaney
  • 73,364
  • 6
  • 83
  • 116
  • I have a similar optimixation issue. ujson & simplejson are not working for me. Can you have a look if your solution above can be applied. Link: https://stackoverflow.com/q/62905750/12968007 – Abhi Jul 15 '20 at 05:33
0

For installation:

pip install orjson 

For import:

import orjson as json

This works especially if you want to dump or load arrays of large size.

F_Schmidt
  • 902
  • 1
  • 11
  • 32