2

Simple question: all tutorials I've read show you how to output the result of a parallel computation to a list (or at best a dictionary) using either ipython.parallel or multiprocessing.

Could you point me to a simple example of outputing the result of a computation to a shared pandas dataframe using either libraries?

http://gouthamanbalaraman.com/blog/distributed-processing-pandas.html - this tutorial show you how to read the input dataframe (code below), but then how would I output the results of the 4 parallel computations to ONE dataframe please?

import pandas as pd
import multiprocessing as mp

LARGE_FILE = "D:\\my_large_file.txt"
CHUNKSIZE = 100000 # processing 100,000 rows at a time

def process_frame(df):
        # process data frame
        return len(df)

if __name__ == '__main__':
        reader = pd.read_table(LARGE_FILE, chunksize=CHUNKSIZE)
        pool = mp.Pool(4) # use 4 processes

        funclist = []
        for df in reader:
                # process each data frame
                f = pool.apply_async(process_frame,[df])
                funclist.append(f)

        result = 0
        for f in funclist:
                result += f.get(timeout=10) # timeout in 10 seconds

        print "There are %d rows of data"%(result)
newmathwhodis
  • 3,209
  • 2
  • 24
  • 26
  • Why don't you place your output into one list and reduce it to one data frame like `reduce(lambda x,y: x.append(y), your_list)` ? – crs May 21 '15 at 19:54
  • You need to show us what you're trying to do, show us your single-threaded solution and how you intend to multiprocess it. – sirfz May 21 '15 at 20:07
  • possible duplicate of [Parallelize apply after pandas groupby](http://stackoverflow.com/questions/26187759/parallelize-apply-after-pandas-groupby) – Mike McKerns May 22 '15 at 11:29

1 Answers1

2

You are asking multiprocessing (or other python parallel modules) to output to a data structure that they don't directly output to. If you use a Pool, from any of the parallel packages, the best you are going to get a list (using map) or an iterator (using imap). If you use shared memory from multiprocessing, you might be able to get the result into a memory block that can be accessed via pointer through ctypes.

So the question then would be, can you pull results from an iterator or from a block of shared memory into a pandas.DataFrame? I think the answer is yes. Yes, you can. However, I don't think that I've seen a simple example of doing so in a tutorial… as it's not that simple to do.

The iterator route seems much less likely, as you'd need to get numpy to digest an iterator without pulling the results back into python first as a list. I'd go with the shared memory route. I think this should give you an output into a DataFrame which you can then use in multiprocessing:

from multiprocessing import sharedctypes as sh
from numpy import ctypeslib as ct        
import pandas as pd

ra = sh.RawArray('i', 4)
arr = ct.as_array(ra)
arr.shape = (2,2)
x = pd.DataFrame(arr)

Then all you'd have to do is pass the handle to the array to multiprocessing.Process:

import multiprocessing as mp
p1 = mp.Process(target=doit, args=(arr[:1, :], 1))
p2 = mp.Process(target=doit, args=(arr[1:, :], 2))
p1.start()
p2.start()
p1.join()
p2.join()

Then, with some pointer magic, the result should filled in your DataFrame .

I'll leave you to write the doit function to manipulate the array as you like.

EDIT: this looks like a good answer using a similar approach…https://stackoverflow.com/a/22487898/2379433. Also this also seems to work: https://stackoverflow.com/a/27027632/2379433.

Community
  • 1
  • 1
Mike McKerns
  • 33,715
  • 8
  • 119
  • 139