7

I need to process a large file and to change some values.

I would like to do something like that:

for index, row in dataFrame.iterrows():

        foo = doSomeStuffWith(row)
        lol = doOtherStuffWith(row)

        dataFrame['colx'][index] = foo
        dataFrame['coly'][index] = lol

Bad for me, I cannot do dataFrame['colx'][index] = foo!

My number of row is quite large and I need to process a large number of column. So I'm afraid that dask may read the file several times if I do one dataFrame.apply(...) for each column.

Other solutions are to manually break my data into chunks and to use pandas or to just throw anything in a database. But it could be nice if I may keep using my .csv and let dask do the chunk processing for me!

Thank for your help.

rpanai
  • 12,515
  • 2
  • 42
  • 64
Caerbanog
  • 71
  • 1
  • 1
  • 2
  • 1
    Did you check out map_partitions? – Arco Bast Mar 18 '17 at 07:06
  • How big is the file? What format is the file? The main thing to do probably is to really understand when the file would be read to avoid re-reading it unnecessarily. – Mike Graham Mar 20 '17 at 02:28
  • @ArcoBast One reason to use this (`.iterrows`) and not `.apply` is for when there's useful data in the index, and at least according to https://stackoverflow.com/questions/26658240/getting-the-index-of-a-row-in-a-pandas-apply-function, it's hard to access the index directly from an `.apply` – Itamar Mushkin Aug 07 '19 at 09:27

2 Answers2

4

In general iterating over a dataframe, either Pandas or Dask, is likely to be quite slow. Additionally Dask won't support row-wise element insertion. This kind of workload is difficult to scale.

Instead I recommend using dd.Series.where (See this answer) or else doing your iteration in a function (after making a copy so as not to operate in place) and then using map_partitions to call that function across all of the Pandas dataframes in your Dask dataframe .

MRocklin
  • 55,641
  • 23
  • 163
  • 235
0

You can just use the same syntax as pandas, although it does evaluate the dask-dataframe as you go along.

for i in dask_df.iterrows():
     print i
user2589273
  • 2,379
  • 20
  • 29