I want to check if my created dataframe is not empty and has at least 1 record. Is there any better approach other than using count method and checking if the count value is greater than 0.
Asked
Active
Viewed 1,778 times
3 Answers
4
Might as well do this:
df.take(1).length == 0
rdd.isEmpty
implements the above functionality internally.

philantrovert
- 9,904
- 3
- 37
- 61
-
It can be even faster. When we're calling `.rdd`, Spark is not able to optimize query for some datasources, i.e. JDBC. Your version uses all optmizations that are possible – T. Gawęda May 17 '17 at 14:32
-
@T.Gawęda Aren't RDDs the underlying sources of everything in Spark? I'm just wondering will `df.rdd` take a lot of time if the dataframe has, say, a million rows? – philantrovert May 17 '17 at 14:33
-
1It is used by Datasets, but when you invoke action on Dataset, when Spark tries to optimize query. Also, call to `rdd` deserializes Rows from internal form to normal - see http://stackoverflow.com/questions/43843470/how-to-know-which-count-query-is-the-fastest – T. Gawęda May 17 '17 at 14:39
3
Use rdd.isEmpty
:
scala> Seq[(Long, String)]((1L, "a")).toDF.rdd.isEmpty
res0: Boolean = false
scala> Seq[(Long, String)]().toDF.rdd.isEmpty
res1: Boolean = true

user8026000
- 46
- 1
0
Use isEmpty of RDD
def isEmpty(): Boolean = withScope {
partitions.length == 0 || take(1).length == 0
}

Piyush Acharya
- 1
- 1