I have a dask dataframe created using chunks of a certain blocksize
:
df = dd.read_csv(filepath, blocksize = blocksize * 1024 * 1024)
I can process it in chunks like this:
partial_results = []
for partition in df.partitions:
partial = trivial_func(partition[var])
partial_results.append(partial)
result = delayed(sum)(partial_results)
(Here I tried using map_partitions
, but ended up just using a for
loop instead). Until this part everything goes ok.
Now, I need to run a function on the same data, but this function needs a to receive a certain number of rows of the dataframe instead (e.g. rows_per_chunk=60
), is this achievable?. With pandas, I would do:
partial_results = []
for i in range(int(len_df/rows_per_chunk)): # I think ceil would be better if decimal
arg_data = df.iloc[i*rows_per_chunk:(i+1)*rows_per_chunk]
partial = not_so_trivial_func(arg_data)
partial_results.append(partial)
result = sum(partial_results)
Is it possible to do something like this with dask? I know that because of lazy evaluation, it's not possible to use iloc
, but is it possible to partition the dataframe in a different way? If not, what would be the most efficient way to achieve this with dask? The dataframe has millions of rows.