4

Is this a valid way of loading subsets of a dask dataframe to memory:

while i < len_df:
    j = i + batch_size 
    if j > len_df: 
        j = len_df
    subset = df.loc[i:j,'source_country_codes'].compute()

I read somewhere that this may not be correct because of how dask assigns index numbers because of it dividing the bigger dataframe into smaller pandas dfs. Also I don't think dask dataframes has an iloc attribute. I am using version 0.15.2

In terms of use cases, this would be a way of loading batches of data to deep learning (say keras).

sachinruk
  • 9,571
  • 12
  • 55
  • 86

1 Answers1

3

If your dataset has well known divisions then this might work, but instead I recommend just computing one partition at a time.

for part in df.to_delayed():
    subset = part.compute()

You can roughly control the size by repartitioning beforehand

for part in df.repartition(npartitions=100).to_delayed():
    subset = part.compute()

This isn't exactly the same, because it doesn't guarantee a fixed number of rows in each partition, but that guarantee might be quite expensive, depending on how the data is obtained.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • But the latter method will guarantee that I will run through the entire dataset right? I should also mention that I had done this earlier: `df = dd.from_pandas(df, 16)`. Will it cause a clash or just do another partition? – sachinruk Oct 19 '17 at 03:29
  • 1
    Yes, this will include the entire dataset. You can repartition safely. Or you can call from_pandas with a different number of partitions. Everything should work fine here either way. – MRocklin Oct 19 '17 at 11:47