1

I want to check if my created dataframe is not empty and has at least 1 record. Is there any better approach other than using count method and checking if the count value is greater than 0.

Anand B
  • 57
  • 3
  • 8

3 Answers3

4

Might as well do this:

df.take(1).length == 0

rdd.isEmpty implements the above functionality internally.

philantrovert
  • 9,904
  • 3
  • 37
  • 61
  • It can be even faster. When we're calling `.rdd`, Spark is not able to optimize query for some datasources, i.e. JDBC. Your version uses all optmizations that are possible – T. Gawęda May 17 '17 at 14:32
  • @T.Gawęda Aren't RDDs the underlying sources of everything in Spark? I'm just wondering will `df.rdd` take a lot of time if the dataframe has, say, a million rows? – philantrovert May 17 '17 at 14:33
  • 1
    It is used by Datasets, but when you invoke action on Dataset, when Spark tries to optimize query. Also, call to `rdd` deserializes Rows from internal form to normal - see http://stackoverflow.com/questions/43843470/how-to-know-which-count-query-is-the-fastest – T. Gawęda May 17 '17 at 14:39
3

Use rdd.isEmpty:

scala> Seq[(Long, String)]((1L, "a")).toDF.rdd.isEmpty
res0: Boolean = false

scala> Seq[(Long, String)]().toDF.rdd.isEmpty

res1: Boolean = true

0

Use isEmpty of RDD

def isEmpty(): Boolean = withScope {
partitions.length == 0 || take(1).length == 0

}