So lets say You have a Python process which is collecting data realtime with around 500 rows per second (this can be further parallelized to reduce to around 50 p.s.) from a queueing system and appending it to a DataFrame
:
rq = MyRedisQueue(..)
df = pd.DataFrame()
while 1:
recv = rq.get(block=True)
# some converting
df.append(recv, ignore_index = True)
Now the question is: How to utilize the CPUs based on this data? So I am fully aware of the limitations of the GIL, and looked into multiprocessing Manager namespace, here too, but it looks like there are some drawbacks with regard to latency on the centerally hold dataframe. Before digging into it, I also tried pool.map
which I than recognized to apply pickle
between the processes, which is way to slow and has too much overhead.
So after all of this I finally wonder, how (if) an insert of 500 rows per second (or even 50 rows per second) can be transfered to different processes with some CPU time left for applying statistics and heuristics on the data in the child processes?
Maybe it would be better to implement a custom tcp socket or queueing system between the two processes? Or are there some implementations in pandas
or other libaries to really allow a fast access to the one big dataframe in the parent process? I love pandas!