16

Simple question: I have a dataframe in dask containing about 300 mln records. I need to know the exact number of rows that the dataframe contains. Is there an easy way to do this?

When I try to run dataframe.x.count().compute() it looks like it tries to load the entire data into RAM, for which there is no space and it crashes.

martineau
  • 119,623
  • 25
  • 170
  • 301
usbToaster
  • 475
  • 1
  • 4
  • 13
  • 2
    The use of count() doesn't seem appropriate here. Try len(df) instead. – KRKirov Mar 16 '18 at 01:14
  • Using len(df) also tries to load the entire dataset into memory for some reason. Is that normal? Did I set up my dataframe wrong? All the data is read from a single hdf5 file, by the way. – usbToaster Mar 16 '18 at 09:18
  • Possible duplicate of [How should I get the shape of a dask dataframe?](https://stackoverflow.com/questions/50355598/how-should-i-get-the-shape-of-a-dask-dataframe) – Aziz Alto Dec 07 '18 at 20:44
  • I got the same issue, my solution: https://stackoverflow.com/a/56835689/3784537 – iperetta Jul 01 '19 at 12:29

2 Answers2

14
# ensure small enough block size for the graph to fit in your memory
ddf = dask.dataframe.read_csv('*.csv', blocksize="10MB") 
ddf.shape[0].compute()

From the documentation:

blocksize <str, int or None> Optional Number of bytes by which to cut up larger files. Default value is computed based on available physical memory and the number of cores, up to a maximum of 64MB. Can be a number like 64000000` or a string like ``"64MB". If None, a single block is used for each file.

CodeWarrior
  • 1,239
  • 1
  • 14
  • 19
1

If you only need the number of rows -
you can load a subset of the columns while selecting the columns with lower memory usage (such as category/integers and not string/object), there after you can run len(df.index)

skibee
  • 1,279
  • 1
  • 17
  • 37