Calling functions on pandas dataframe parallely

Question

I have a 200 million record file, which is being read using pandas read_csv in chunksize of 10000. These dataframes are being converted into a list object, and this list object is passed to a function.

file_name=str(sys.argv[2])
df=pd.read_csv(file_name, na_filter=False, chunksize=10000)
for data in df:
    d=data.values.tolist()
    load_data(d)

Is there any way load_data function call can be run parallelly, so that more than 10000 records can be passed to the function at the same time?

I tried using solutions mentioned in below questions:

But these don't work for me, as I need to convert the dataframe into list object first before calling the function.

Any help will be highly appreciated.

score 2 · Accepted Answer · answered Oct 26 '20 at 14:41

2

Yes, dask is very good at this

Try

import dask.dataframe as dd

dx = dd.read_csv(file_name, na_filter=False)

ans_delayed = dx.apply(my_function, meta='{the return type}')

ans = ans_delayed.compute()

If you really need the data as a list, you could try

import dask.bag as db

genrator = pd.read_csv(file_name, na_filter=False, chunksize=10000)

ans = db.from_sequence(generator).map(lambda df: 
load_data(df.values.tolist())).compute()

answered Oct 26 '20 at 14:41

Myccha

961
1
11
20

Can you show a time improvement with dask over pandas? – Sergey Bushmanov Oct 26 '20 at 17:15

Calling functions on pandas dataframe parallely

1 Answers1