Dask Dataframe: Get row count?

Question

Simple question: I have a dataframe in dask containing about 300 mln records. I need to know the exact number of rows that the dataframe contains. Is there an easy way to do this?

When I try to run dataframe.x.count().compute() it looks like it tries to load the entire data into RAM, for which there is no space and it crashes.

The use of count() doesn't seem appropriate here. Try len(df) instead. — KRKirov, Mar 16 '18 at 01:14
Using len(df) also tries to load the entire dataset into memory for some reason. Is that normal? Did I set up my dataframe wrong? All the data is read from a single hdf5 file, by the way. — usbToaster, Mar 16 '18 at 09:18
Possible duplicate of [How should I get the shape of a dask dataframe?](https://stackoverflow.com/questions/50355598/how-should-i-get-the-shape-of-a-dask-dataframe) — Aziz Alto, Dec 07 '18 at 20:44
I got the same issue, my solution: https://stackoverflow.com/a/56835689/3784537 — iperetta, Jul 01 '19 at 12:29

CodeWarrior · Answer 1 · 2021-07-27T10:29:55.373

# ensure small enough block size for the graph to fit in your memory
ddf = dask.dataframe.read_csv('*.csv', blocksize="10MB") 
ddf.shape[0].compute()

From the documentation:

blocksize <str, int or None> Optional Number of bytes by which to cut up larger files. Default value is computed based on available physical memory and the number of cores, up to a maximum of 64MB. Can be a number like 64000000` or a string like ``"64MB". If None, a single block is used for each file.

skibee · Answer 2 · 2019-12-04T10:45:17.970

1

If you only need the number of rows -
you can load a subset of the columns while selecting the columns with lower memory usage (such as category/integers and not string/object), there after you can run len(df.index)

edited Dec 04 '19 at 10:45

answered Oct 30 '19 at 11:27

skibee

1,279
1
17
37

Dask Dataframe: Get row count?

2 Answers2