Loading Pandas Data Frame with Multiprocessing

Asked Jan 12 '21 at 03:58

Active Jan 12 '21 at 03:58

Viewed 63 times

I am trying to load a data frame from a large csv file as shown below. Currently, this line fails due to an out of memory error. I would like to use the multiprocessing package (from multiprocessing.pool import ThreadPool) when loading from this csv file. Here is what I am trying to run using Multiprocessing:

source_data_df = pd.read_csv(temp_file, skipinitialspace=True, dtype=str, na_values=['N.A.'])

Could someone show how this line and the additional code would look when running with Multiprocessing?

Big thank you!

Michael

asked Jan 12 '21 at 03:58

bda

Why do you think using multiprocessing will solve `Out of Memory` error. Whether a single Python holds the entire dataframe or multiple processes hold chunks of the dataframe, the total memory required to hold all the data will be more or less the same. – Shiva Jan 12 '21 at 04:04
1

please take a look at option `chunksize` as this https://stackoverflow.com/questions/25962114/how-do-i-read-a-large-csv-file-with-pandas. – antoine Jan 12 '21 at 04:39

Loading Pandas Data Frame with Multiprocessing

0 Answers0