Process dask dataframe by chunks of rows

Question

I have a dask dataframe created using chunks of a certain blocksize:

df = dd.read_csv(filepath, blocksize = blocksize * 1024 * 1024)

I can process it in chunks like this:

partial_results = []
for partition in df.partitions:
    partial = trivial_func(partition[var])
    partial_results.append(partial)
result = delayed(sum)(partial_results)

(Here I tried using map_partitions, but ended up just using a for loop instead). Until this part everything goes ok.

Now, I need to run a function on the same data, but this function needs a to receive a certain number of rows of the dataframe instead (e.g. rows_per_chunk=60), is this achievable?. With pandas, I would do:

partial_results = []
for i in range(int(len_df/rows_per_chunk)): # I think ceil would be better if decimal
    arg_data = df.iloc[i*rows_per_chunk:(i+1)*rows_per_chunk]
    partial = not_so_trivial_func(arg_data)
    partial_results.append(partial)
result = sum(partial_results)

Is it possible to do something like this with dask? I know that because of lazy evaluation, it's not possible to use iloc, but is it possible to partition the dataframe in a different way? If not, what would be the most efficient way to achieve this with dask? The dataframe has millions of rows.

Do you need these chunks to represent consecutive rows? (that would happen in `pandas`) — SultanOrazbayev, Jan 21 '21 at 18:46
Yes, the first n rows belong to a certain group and need to be processed separately from the next n rows, and so on. — 6659081, Jan 21 '21 at 20:09

SultanOrazbayev · Accepted Answer · 2021-02-05T14:47:44.657

You can repartition the dataframe along a division which defines how index values should be allocated across partitions (assuming unique index).

import dask.dataframe as dd
import pandas as pd

df = pd.DataFrame(range(15), columns=['x'])
ddf = dd.from_pandas(df, npartitions=3)

# there will 5 rows per partition
print(ddf.map_partitions(len).compute())

# you can see that ddf is split along these index values
print(ddf.divisions)

# change the divisions to have the desired spacing
new_divisions = (0, 3, 6, 9, 12, 14)
new_ddf = ddf.repartition(divisions=new_divisions)

# now there will be 3 rows per partition
print(new_ddf.map_partitions(len).compute())

If index is not known, then it's possible to create a new index (assuming that rows do not require sorting) and repartition along the computed divisions:

import dask.dataframe as dd
import pandas as pd

# save some data into unindexed csv
num_rows = 15
df = pd.DataFrame(range(num_rows), columns=['x'])
df.to_csv('dask_test.csv', index=False)


# read from csv
ddf = dd.read_csv('dask_test.csv', blocksize=10)

# assume that rows are already ordered (so no sorting is needed)
# then can modify the index using the lengths of partitions
cumlens = ddf.map_partitions(len).compute().cumsum()

# since processing will be done on a partition-by-partition basis, save them
# individually
new_partitions = [ddf.partitions[0]]
for npart, partition in enumerate(ddf.partitions[1:].partitions):
    partition.index = partition.index + cumlens[npart]
    new_partitions.append(partition)

# this is our new ddf
ddf = dd.concat(new_partitions)

#  set divisions based on cumulative lengths
ddf.divisions = tuple([0] + cumlens.tolist())

# change the divisions to have the desired spacing
new_partition_size = 12
max_rows = cumlens.tolist()[-1]
new_divisions = list(range(0, max_rows, new_partition_size))
if new_divisions[-1]<max_rows:
    new_divisions.append(max_rows)
new_ddf = ddf.repartition(divisions=new_divisions)

# now there will be desired rows per partition
print(new_ddf.map_partitions(len).compute())

Hi @SultanOrazbayev, I tried your solution and your code works, but not for my case. The problem is that with the amount of data I have, I can't create a Pandas dataframe first. — 6659081, Feb 04 '21 at 13:55
I get `left side of old and new divisions are different`, because `ddf.divisions` returns a tuple of just `None`. — 6659081, Feb 04 '21 at 14:11
I see, are you loading from parquet or csv? It looks like your original data is not indexed. — SultanOrazbayev, Feb 04 '21 at 15:40
I'm loading from CSV, and indeed, my data is not indexed. Does `dask` need indexed data always? — 6659081, Feb 05 '21 at 10:29
Index is needed for some operations, not all. Please see the updated example. I hope that works. — SultanOrazbayev, Feb 05 '21 at 14:48

Process dask dataframe by chunks of rows

1 Answers1

Linked