I have a data frame df
that contains around 1 Gb of data. Why the command df.count()
takes a relatively long time to complete, while df.filter(...)
is much faster? Is there any better way to estimate the number of entries in df
that is faster than df.count()
'
Asked
Active
Viewed 1.0k times
6

Dinosaurius
- 8,306
- 19
- 64
- 113
-
Without iterating the DF, how do you find the count? and what do you mean by df.filter is faster? filter is totally for different purpose. – Shankar May 24 '17 at 08:35
-
@Shankar: If I correctly understand your first question, `df.count()` returns me the number of rows in `df`. As to your second question, I know that both commands serve for different purposes. I put `filter` as an example. I just want to know why `df.count()` takes 10-15 minutes to output the result and if there is any faster way to calculate the number of rows? – Dinosaurius May 24 '17 at 08:38
-
This might help in understanding why this happens: https://stackoverflow.com/questions/38027877/spark-transformation-why-its-lazy-and-what-is-the-advantage. Also, I don't think there is a faster way to get the count of the entire RDD. – ar7 May 24 '17 at 08:40
-
@Dinosaurius: To get the count from DF , `df.count` is the only way, what you can do is, if the data is already available in file system, you could use some linux command to find the count. – Shankar May 24 '17 at 08:41
-
how is df being created? what filter are you doing? If you provide the code for it then it would be possible to give better guesses – Assaf Mendelson May 24 '17 at 09:11
-
You should read https://stackoverflow.com/questions/43843470/how-to-know-which-count-query-is-the-fastest – Umberto Griffo May 24 '17 at 10:04
1 Answers
11
df.count()
is the correct way.
Note that df.filter(...)
is a transformation, which means it is lazy, i.e. the filtering code isn't executed yet. It will only be executed if you add an actiton like count
or collect
to the filtered result. And then the runtime should be similar to the original call to count
.

Harald Gliebe
- 7,236
- 3
- 33
- 38