Check if number of records in dataframe is greater than zero without using count spark

Question

I want to check if my created dataframe is not empty and has at least 1 record. Is there any better approach other than using count method and checking if the count value is greater than 0.

score 4 · Answer 1 · answered May 17 '17 at 14:26

4

Might as well do this:

df.take(1).length == 0

rdd.isEmpty implements the above functionality internally.

answered May 17 '17 at 14:26

philantrovert

9,904
3
37
61

It can be even faster. When we're calling `.rdd`, Spark is not able to optimize query for some datasources, i.e. JDBC. Your version uses all optmizations that are possible – T. Gawęda May 17 '17 at 14:32
@T.Gawęda Aren't RDDs the underlying sources of everything in Spark? I'm just wondering will `df.rdd` take a lot of time if the dataframe has, say, a million rows? – philantrovert May 17 '17 at 14:33
1

It is used by Datasets, but when you invoke action on Dataset, when Spark tries to optimize query. Also, call to `rdd` deserializes Rows from internal form to normal - see http://stackoverflow.com/questions/43843470/how-to-know-which-count-query-is-the-fastest – T. Gawęda May 17 '17 at 14:39

score 3 · Accepted Answer · answered May 17 '17 at 14:13

3

Use rdd.isEmpty:

scala> Seq[(Long, String)]((1L, "a")).toDF.rdd.isEmpty
res0: Boolean = false

scala> Seq[(Long, String)]().toDF.rdd.isEmpty

res1: Boolean = true

answered May 17 '17 at 14:13

user8026000

46
1

Will that be faster than count? – Anand B May 17 '17 at 14:17

score 0 · Answer 3 · answered May 17 '17 at 14:36

0

Use isEmpty of RDD

def isEmpty(): Boolean = withScope {
partitions.length == 0 || take(1).length == 0

}

answered May 17 '17 at 14:36

Piyush Acharya

1
1

Check if number of records in dataframe is greater than zero without using count spark

3 Answers3