Parallelize pandas function pd.concat

Question

I have a huge list of data frames called df_list (with some different and some common columns) which I wish to merge into one big data frame. I have tried the following:

all_dfs = pd.concat(df_list)

Though this takes too much time on a single core. I killed the script after 48 hours. How would you parallelize this process to use all my cores or rewrite the code to make it faster

Danila Ganchar · Answer 1 · 2019-12-31T08:34:00.490

0

pandas - is not about parallel processing.

The easiest way is to use third-party tools to process huge data frames. You can run computing / processing of data set on different nodes.

You can look at dask (similar with pandas interface).
You can look at pyspark.

Also you can use swifter to runs processing on multiple cores.

There are probably some other tools... In other words, in your case it is better to run calculations in a cluster.

Hope this helps.

edited Dec 31 '19 at 08:34

answered Dec 30 '19 at 13:46

Danila Ganchar

10,266
13
49
75

[good example pandas + dask](https://stackoverflow.com/questions/45545110/how-do-you-parallelize-apply-on-pandas-dataframes-making-use-of-all-cores-on-o/45545111#45545111) – Danila Ganchar Dec 30 '19 at 13:53

Parallelize pandas function pd.concat

1 Answers1