3

I am doing an experiment where the result is high dimensional structured. I use a MultiIndex to represent the result object and use multiprocessing to compute and fill it. The result set is quite large, which can be easily up to millions to billions of entries. If the result is 3D, I can let the function which does the computation return a df and then combine them into a panel afterwards.

When the result object is 5D or higher, I found it not straight-forward and memory consuming to return the subset of result from each function performed in a single process. However, it does not work if I let each process write their result directly to the MultiIndex global variable (the result) which had been created before the computation. The values of the result df are all NaN as it is been created.

Any suggestions are greatly appreciated!

martineau
  • 119,623
  • 25
  • 170
  • 301
Warren
  • 991
  • 11
  • 28
  • What kind of multiprocessing are you using? I would recommend to take a look at dask data frames or if it's not enough at Apache Spark SQL – MaxU - stand with Ukraine Jun 08 '16 at 18:54
  • 1
    I don't know multiindex. Only sharing memory (not objects) allow shared read and write. Subprocess created by Linux fork will copy a globally shared object if the object is modified by the subprocess. This way, a subprocess modifies its local copy of the object, which is not shared with other processes. I guess that is why you see NaN. see this: http://stackoverflow.com/questions/7894791/use-numpy-array-in-shared-memory-for-multiprocessing and also this: http://stackoverflow.com/questions/659865/python-multiprocessing-sharing-a-large-read-only-object-between-processes – rxu Jun 09 '16 at 17:07

0 Answers0