1
if(df.count()== 0){
    System.out.println("df is an empty dataframe");
 }

The above is a way to check if a DataFrame is empty or not without getting a null pointer exception.

Is there any other best way to do so in Spark as I am worried that if the DataFrame df gets millions of records, the above statement will be taking a long time to get executed.

user5626966
  • 21
  • 1
  • 1
  • 2
  • The above code will get a NullPointerException if df is not a valid object; but generally `Object.count()` is an inexpensive call. – Mark Stewart May 23 '17 at 00:24
  • Yes, df is declared and initialised properly as a DataFrame. But the scenario is it can be both having values and it can also be empty/null. – user5626966 May 23 '17 at 00:28

2 Answers2

2

I recently come across one such scenario. The following are some of the ways to check if a dataframe is empty.

  • df.count() == 0
  • df.head().isEmpty
  • df.rdd.isEmpty
  • df.first().isEmpty

Although it is better to avoid count() since it is more expensive. However there might be some situations where you are very certain that the dataframe would have either a single row or no record at all (For ex: Executing a max() function in an Hive query). In such situations, it is okay to use count().

Sivaprasanna Sethuraman
  • 4,014
  • 5
  • 31
  • 60
1

Taking the count can be slower. Instead you can just check if the head element is not empty.

df.head(1).isEmpty

Add an exception handling to this as it will throw java.util.NoSuchElementException if df is empty.

Update: Check out How to check if spark dataframe is empty

Devendra Lattu
  • 2,732
  • 2
  • 18
  • 27