1

I have a 200 million record file, which is being read using pandas read_csv in chunksize of 10000. These dataframes are being converted into a list object, and this list object is passed to a function.

file_name=str(sys.argv[2])
df=pd.read_csv(file_name, na_filter=False, chunksize=10000)
for data in df:
    d=data.values.tolist()
    load_data(d)

Is there any way load_data function call can be run parallelly, so that more than 10000 records can be passed to the function at the same time?

I tried using solutions mentioned in below questions:

  1. Python iterating over a list in parallel?
  2. How to run functions in parallel?

But these don't work for me, as I need to convert the dataframe into list object first before calling the function.

Any help will be highly appreciated.

martineau
  • 119,623
  • 25
  • 170
  • 301
Yash Sharma
  • 324
  • 6
  • 25

1 Answers1

2

Yes, dask is very good at this

Try

import dask.dataframe as dd

dx = dd.read_csv(file_name, na_filter=False)

ans_delayed = dx.apply(my_function, meta='{the return type}')

ans = ans_delayed.compute()

If you really need the data as a list, you could try

import dask.bag as db

genrator = pd.read_csv(file_name, na_filter=False, chunksize=10000)

ans = db.from_sequence(generator).map(lambda df: 
load_data(df.values.tolist())).compute()
Myccha
  • 961
  • 1
  • 11
  • 20