First, I'm assuming you mean iterator here, not generator (see Difference between Python's Generators and Iterators).
So, you can use iterrows
on a dask DataFrame, but you shouldn't, because it'll be very slow and inefficient. (Also as a sidenote, itertuples
is usually preferred over iterrows
.)
Dask's iterrows
method has to compute each partition, then iterate over its rows, which eliminates a lot of the benefits of Dask. But your gen_pandas
function will behave the same on ddf
as on df
. (You also don't need the function at all; gen_pandas(df)
is exactly the same as just df.iterrows()
)
turning each partition into a generator
If you meant that you want to iterate over partitions, not rows, then you can use ddf.partitions
to access each partition as its own dask DataFrame:
>>> list(ddf.partitions)
[Dask DataFrame Structure:
a
npartitions=1
0 int64
4 ...
Dask Name: blocks, 4 tasks,
Dask DataFrame Structure:
a
npartitions=1
4 int64
8 ...
Dask Name: blocks, 4 tasks,
Dask DataFrame Structure:
a
npartitions=1
8 int64
9 ...
Dask Name: blocks, 4 tasks]
However, in either case, you should consider not trying to iterate over a dask DataFrame. Instead, try mapping your operation using apply
or map_partitions
. This will allow dask to run the operation in parallel over every partition at once.
For example, instead of:
ids = []
counts = []
for row in ddf.iterrows():
page = download_wikipedia_page(row.wikipedia_name)
link_count = count_num_links(page)
ids.append(row.id)
results.append(link_count)
counts = pd.Series(counts, ids)
You'd do this with Dask:
def get_wikipedia_page_count(row: pd.Series) -> int:
page = download_wikipedia_page(row.wikipedia_name)
return count_num_links(page)
counts = ddf.apply(get_wikipedia_page_count, axis=1, meta=int)
Iterators are fundamentally serial. Dask is fundamentally parallel. By forcing Dask to operate serially, you lose a lot of the benefits of Dask. To get good performance, you'll need to refactor iterator-based code to map that operation over every partition in parallel.
Alternatively, if the iterator approach makes more sense for your code, you might be interested in the streamz library, which is for making streaming pipelines and integrates with Dask.