Row by row processing of a Dask DataFrame

Question

I need to process a large file and to change some values.

I would like to do something like that:

for index, row in dataFrame.iterrows():

        foo = doSomeStuffWith(row)
        lol = doOtherStuffWith(row)

        dataFrame['colx'][index] = foo
        dataFrame['coly'][index] = lol

Bad for me, I cannot do dataFrame['colx'][index] = foo!

My number of row is quite large and I need to process a large number of column. So I'm afraid that dask may read the file several times if I do one dataFrame.apply(...) for each column.

Other solutions are to manually break my data into chunks and to use pandas or to just throw anything in a database. But it could be nice if I may keep using my .csv and let dask do the chunk processing for me!

Thank for your help.

How big is the file? What format is the file? The main thing to do probably is to really understand when the file would be read to avoid re-reading it unnecessarily. — Mike Graham, Mar 20 '17 at 02:28
@ArcoBast One reason to use this (`.iterrows`) and not `.apply` is for when there's useful data in the index, and at least according to https://stackoverflow.com/questions/26658240/getting-the-index-of-a-row-in-a-pandas-apply-function, it's hard to access the index directly from an `.apply` — Itamar Mushkin, Aug 07 '19 at 09:27

score 4 · Answer 1 · answered Apr 04 '18 at 11:40

In general iterating over a dataframe, either Pandas or Dask, is likely to be quite slow. Additionally Dask won't support row-wise element insertion. This kind of workload is difficult to scale.

Instead I recommend using dd.Series.where (See this answer) or else doing your iteration in a function (after making a copy so as not to operate in place) and then using map_partitions to call that function across all of the Pandas dataframes in your Dask dataframe .

user2589273 · Answer 2 · 2019-03-26T13:04:06.157

0

You can just use the same syntax as pandas, although it does evaluate the dask-dataframe as you go along.

for i in dask_df.iterrows():
     print i

edited Mar 26 '19 at 13:04

answered Feb 20 '19 at 11:09

user2589273

2,379
20
29

2

it will load entire dataframe in memory. It doesn't work for large datasets – Goran Mar 17 '20 at 14:07

Row by row processing of a Dask DataFrame

2 Answers2

Linked