Using multiprocessing map with a pandas dataframe?

Question

I am using (python's) panda's map function to process a big CSV file (~50 gigabytes), like this:

import pandas as pd

df = pd.read_csv("huge_file.csv")
df["results1"], df["results2"] = df.map(foo)
df.to_csv("output.csv")

Is there a way I can use parallelization on this? Perhaps using multiprocessing's map function?

Thanks, Jose

score 2 · Answer 1 · edited May 23 '17 at 12:21

2

See docs on reading by chunks here, example here, and appending here

You are much better off reading your csv in chunks, processing, then writing it out to a csv (of course you evven better off converting to HDF).

Takes a relatively constant amount of memory
efficient, can be done in parallel (usually requires having an HDF file that you can select sections from though; a csv is not good for this).
less complicated that trying to do multi-processing directly

edited May 23 '17 at 12:21

Community

answered May 08 '14 at 15:44

Jeff

Note that (much like sharding in a Mongo database) chunk-level parallelism doesn't work well if you need overlapping data (like a rolling time series regression) in the operations to be mapped. In those cases, it's much faster to form the pandas groups first and manually dispatch them to different resources for computing, like each scattered to an engine in IPython.parallel. – ely May 08 '14 at 16:28

1 Answers1