0

Is there a way to speed up converting streams of strings to json/dictionary using pypy3? I know ujson in python3 could be faster than python3's json, but its not really faster than pypy3's json.loads().

More info on what I have, I have a program reading stream of json strings from a subprocess and converting (loading) them using json.loads(). If I comment out the json load execution line (basically just reading the stdout from the subprocess, it just takes about 60% of the total execution.

So I was thinking using a pool of processes or threads could improve it (maybe at least up to 80% execution time), and perform the conversion in parallel. Unfortunately, it did not do anything. Using multithreads had the same results, and the multiprocess took longer than 1 single process (probably mostly due to the overhead and serialization). Is there any other change I could improve performance with pypy3?

For reference, here's a quick example of code (just reading from some file instead):

import json
import timeit
from multiprocessing.pool import ThreadPool
from multiprocessing import Pool


def get_stdout():
    with open("input.txt", "r") as f:
        for line in f:
            yield line


def convert(line):
    d = json.loads(line)
    return d


def multi_thread():
    mt_pool = ThreadPool(3)
    for dict in mt_pool.imap(convert, get_stdout()):
        pass


def multi_process():
    with Pool(3) as mp_pool:
        for dict in mp_pool.imap(convert, get_stdout()):
            pass


def regular():
    for line in get_stdout():
        d = convert(line)


print("regular: ", timeit.repeat("regular()", setup="from __main__ import regular", number=1, repeat=5))
print("multi_thread: ", timeit.repeat("multi_thread()", setup="from __main__ import multi_thread", number=1, repeat=5))
print("multi_process: ", timeit.repeat("multi_process()", setup="from __main__ import multi_process", number=1, repeat=5))

Output:

regular: [5.191860154001915, 5.045155504994909, 4.980729935996351, 5.253822096994554, 5.9532385260026786]
multi_thread: [5.08890142099699, 5.088432839998859, 5.156651658995543, 5.781010364997201, 5.082046301999071]
multi_process: [26.595598744999734, 30.841693959999247, 29.383782051001617, 27.83700947300531, 21.377069750000373]
user1179317
  • 2,693
  • 3
  • 34
  • 62
  • What do you really want? Do you really want to load the full document? If so, then the speed of all library will be bounded by the speed to create Python objects (which is slow unless there are few big objects). Few library may be slower than that but none can be faster (unless the document is lazily evaluated which mean that not the full document is loaded). – Jérôme Richard Nov 18 '21 at 09:18
  • My idea was if it takes say 10s to read a stream of json from stdout (just picking a number here), then say another 10s to convert it to python objects. With a single process, it should take 60 to execute 3 stream of json, with 3 conversions. If it was a pool of 2 processes, technically, it should drop it to ~40s or at least with 3 processes, but def not seeing that. Of course more things will be done after the conversion but thats beside the point. I just want faster time to get the python objects to perform other calculations. Was hoping there are either better modules or better tricks – user1179317 Nov 18 '21 at 15:25

1 Answers1

1

The problem with the current code is that the cost of sending Python object back from imap processes is very expensive due to inter-process communication. Sending the input strings to worker processes may also decrease the performance too. The current code should not be bounded by the input parsing but by the creation CPython objects and the transfer of the objects between processes. The threading version is not useful since the code is likely not IO bound and the Python Global Interpreter Lock (ie. GIL) prevent any performance improvement with multiple threads currently. The multiprocessing version can be made much faster by not sending results back : you need to process the json documents in the worker processes and send back the minimal amount of information to the main process. You can use a shared memory area to retrieved basic Numpy arrays efficiently from the workers to the main process if needed. Finally, using Simdjson can be much faster as long as you do not need to decode all the document.

PyPy had a scalable transactional memory mechanism probably useful to implement a faster multithreaded json parsing implementation. However, AFAIK no module is currently using it and pypy-stm is not developed.

If you really want to process (big/many) Json documents to a bunch of heterogeneous objects and you need to compute all of them quickly (without using multiple independent processes), then Python is probably not the right tool.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59