How to check if spark dataframe is empty?

Question

Right now, I have to use df.count > 0 to check if the DataFrame is empty or not. But it is kind of inefficient. Is there any better way to do that?

PS: I want to check if it's empty so that I only save the DataFrame if it's not empty

score 210 · Answer 1 · edited Jan 29 '18 at 21:38

210

For Spark 2.1.0, my suggestion would be to use head(n: Int) or take(n: Int) with isEmpty, whichever one has the clearest intent to you.

df.head(1).isEmpty
df.take(1).isEmpty

with Python equivalent:

len(df.head(1)) == 0  # or bool(df.head(1))
len(df.take(1)) == 0  # or bool(df.take(1))

Using df.first() and df.head() will both return the java.util.NoSuchElementException if the DataFrame is empty. first() calls head() directly, which calls head(1).head.

def first(): T = head()
def head(): T = head(1).head

head(1) returns an Array, so taking head on that Array causes the java.util.NoSuchElementException when the DataFrame is empty.

def head(n: Int): Array[T] = withAction("head", limit(n).queryExecution)(collectFromPlan)

So instead of calling head(), use head(1) directly to get the array and then you can use isEmpty.

take(n) is also equivalent to head(n)...

def take(n: Int): Array[T] = head(n)

And limit(1).collect() is equivalent to head(1) (notice limit(n).queryExecution in the head(n: Int) method), so the following are all equivalent, at least from what I can tell, and you won't have to catch a java.util.NoSuchElementException exception when the DataFrame is empty.

df.head(1).isEmpty
df.take(1).isEmpty
df.limit(1).collect().isEmpty

I know this is an older question so hopefully it will help someone using a newer version of Spark.

edited Jan 29 '18 at 21:38

zero323

322,348
103
959
935

answered Apr 13 '17 at 04:10

hulin003

2,554
2
13
9

39

For those using pyspark. isEmpty is not a thing. Do len(d.head(1)) > 0 instead. – AntiPawn79 Nov 21 '17 at 16:39
7

why is this better then `df.rdd.isEmpty`? – Dan Ciborowski - MSFT Jan 20 '18 at 03:33
3

df.head(1).isEmpty is taking huge time is there any other optimized solution for this. – Rakesh Sabbani Feb 20 '19 at 06:26
2

Hey @Rakesh Sabbani, If `df.head(1)` is taking a large amount of time, it's *probably* because your `df`'s execution plan is doing something complicated that prevents spark from taking shortcuts. For example, if you are just reading from parquet files, `df = spark.read.parquet(...)`, I'm pretty sure spark will only read one file partition. But if your `df` is doing other things like aggregations, you may be inadvertently forcing spark to read and process a large portion, if not all, of you source data. – hulin003 Mar 28 '19 at 19:35
4

just reporting my experience to AVOID: I was using `df.limit(1).count()` naively. On big datasets it takes much more time than the reported examples by @hulin003 which are almost instantaneous – Vzzarr Dec 05 '19 at 12:27
a little remark to this solution: you should avoid using df.head(1).isEmpty OR df.take(1).isEmpty on dataframes with > 100 columns because in can cause org.codehaus.janino.JaninoRuntimeException – jd2050 Jan 28 '20 at 11:50
1

@hulin003 I'm using df.take(1).isEmpty based on your answer, but it takes a very long time(2 mins) for even a a couple of hundred rows. Any help? – user2441441 Nov 05 '20 at 05:37
For pyspark, simply use `if df.head(1): ...`. Head returns a list, which might be empty, and empty lists evaluate to `False`. That's the idiomatic way in python. – Francesco Pasa Aug 09 '23 at 07:11

score 58 · Answer 2 · edited Jan 29 '18 at 21:30

58

I would say to just grab the underlying RDD. In Scala:

df.rdd.isEmpty

in Python:

df.rdd.isEmpty()

That being said, all this does is call take(1).length, so it'll do the same thing as Rohan answered...just maybe slightly more explicit?

edited Jan 29 '18 at 21:30

zero323

322,348
103
959
935

answered Sep 22 '15 at 04:14

Justin Pihony

66,056
18
147
180

8

This is surprisingly slower than df.count() == 0 in my case – architectonic Dec 02 '15 at 12:40
3

Isn't converting to rdd a heavy task? – Alok Jan 28 '16 at 06:42
1

Not really. RDD's still are the underpinning of everything Spark for the most part. – Justin Pihony Feb 17 '16 at 03:13
40

Don't convert the df to RDD. It slows down the process. If you convert it will convert whole DF to RDD and check if its empty. Think if DF has millions of rows, it takes lot of time in converting to RDD itself. – Nandakishore Nov 01 '16 at 21:18
3

.rdd slows down so much the process like a lot – Raul H Nov 09 '16 at 22:04
1

If you call rdd's `isEmpty` method on a big dataframe, it will extremely slow it down especially when the dataframe isn't cached. – Abdul Mannan Feb 16 '18 at 13:37

score 34 · Answer 3 · edited Jan 01 '21 at 16:57

34

I had the same question, and I tested 3 main solution :

(df != null) && (df.count > 0)
df.head(1).isEmpty() as @hulin003 suggest
df.rdd.isEmpty() as @Justin Pihony suggest

and of course the 3 works, however in term of perfermance, here is what I found, when executing the these methods on the same DF in my machine, in terme of execution time :

it takes ~9366ms
it takes ~5607ms
it takes ~1921ms

therefore I think that the best solution is df.rdd.isEmpty() as @Justin Pihony suggest

edited Jan 01 '21 at 16:57

vahlala

355
3
14

answered Feb 14 '20 at 09:17

aName

2,751
3
32
61

9

out of curiosity... what size DataFrames was this tested with? – aiguofer Jul 06 '20 at 16:59
2

I've tested 10 million rows... and got the same time as for df.count() or df.rdd.isEmpty() – Glib Martynenko May 31 '22 at 19:23
In my use case, I want to test if at least one row contains a particular string. `isEmpy()` is faster the majority of times since it finds at least a row and stops. When checking for a string not found in the df. Both `count()` and `isEmpty()` have to scan the whole df and then take the same time. – kael Aug 01 '23 at 11:51

score 17 · Answer 4 · answered Mar 12 '19 at 19:05

17

Since Spark 2.4.0 there is Dataset.isEmpty.

It's implementation is :

def isEmpty: Boolean = 
  withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan =>
    plan.executeCollect().head.getLong(0) == 0
}

Note that a DataFrame is no longer a class in Scala, it's just a type alias (probably changed with Spark 2.0):

type DataFrame = Dataset[Row]

answered Mar 12 '19 at 19:05

Beryllium

12,808
10
56
86

1

isEmpty is slower than df.head(1).isEmpty – Sandeep540 Oct 23 '19 at 20:30
@Sandeep540 Really? Benchmark? Your proposal instantiates at least one row. The Spark implementation just transports a number. head() is using limit() as well, the groupBy() is not really doing anything, it is required to get a RelationalGroupedDataset which in turn provides count(). So that should not be significantly slower. It is probably faster in case of a data set which contains a lot of columns (possibly denormalized nested data). Anway you have to type less :-) – Beryllium Oct 24 '19 at 11:52
Beware: I am using `.option("mode", "DROPMALFORMED")` and `df.isEmpty` returned `false` whereas `df.head(1).isEmpty` returned the correct result of `true` because... all of the rows were malformed (someone upstream changed the schema on me). – Mark Rajcok Apr 08 '22 at 15:53

score 16 · Answer 5 · edited Oct 16 '15 at 16:03

16

You can take advantage of the head() (or first()) functions to see if the DataFrame has a single row. If so, it is not empty.

edited Oct 16 '15 at 16:03

A F

7,424
8
40
52

answered Sep 22 '15 at 03:16

Rohan Aletty

2,432
1
14
20

10

if dataframe is empty it throws "java.util.NoSuchElementException: next on empty iterator" ; [Spark 1.3.1] – FelixHo May 26 '16 at 03:53

score 10 · Answer 6 · edited Jan 05 '19 at 22:25

10

If you do df.count > 0. It takes the counts of all partitions across all executors and add them up at Driver. This take a while when you are dealing with millions of rows.

The best way to do this is to perform df.take(1) and check if its null. This will return java.util.NoSuchElementException so better to put a try around df.take(1).

The dataframe return an error when take(1) is done instead of an empty row. I have highlighted the specific code lines where it throws the error.

edited Jan 05 '19 at 22:25

Ram Ghadiyaram

28,239
13
95
121

answered Nov 01 '16 at 21:17

Nandakishore

981
1
9
22

1

if you run this on a massive dataframe with millions of records that `count` method is going to take some time. – TheM00s3 Nov 04 '16 at 17:35
using df.take(1) when the df is empty results in getting back an empty ROW which cannot be compared with null – LetsPlayYahtzee Mar 16 '17 at 19:45
i'm using first() instead of take(1) in a try/catch block and it works – Vasile Surdu Mar 21 '17 at 10:38
1

@LetsPlayYahtzee I have updated the answer with same run and picture that shows error. take(1) returns Array[Row]. And when Array doesn't have any values, by default it gives ArrayOutOfBounds. So I don't think it gives an empty Row. I would say to observe this and change the vote. – Nandakishore Jul 03 '17 at 22:46

score 9 · Answer 7 · edited Jun 17 '22 at 07:36

9

PySpark 3.3.0+ / Scala 2.4.0+

df.isEmpty()

edited Jun 17 '22 at 07:36

ZygD

22,092
39
79
102

answered Aug 02 '21 at 23:20

user3370741

144
1
4

'DataFrame' object has no attribute 'isEmpty'. Spark 3.0 – Glib Martynenko May 31 '22 at 19:21
2

In PySpark, it's introduced only from version 3.3.0 – ZygD Jun 17 '22 at 07:36
1

In scala current you should do df.isEmpty without parenthesis (). – Chris Amelinckx Nov 10 '22 at 17:45

Adelholzener · Answer 8 · 2021-11-01T12:32:50.437

7

If you are using Pyspark, you could also do:

len(df.head(1)) > 0

edited Nov 01 '21 at 12:32

answered Nov 26 '18 at 15:56

Adelholzener

71
1
3

score 6 · Answer 9 · answered Jun 29 '18 at 12:02

For Java users you can use this on a dataset :

public boolean isDatasetEmpty(Dataset<Row> ds) {
        boolean isEmpty;
        try {
            isEmpty = ((Row[]) ds.head(1)).length == 0;
        } catch (Exception e) {
            return true;
        }
        return isEmpty;
}

This check all possible scenarios ( empty, null ).

Shaido · Answer 10 · 2019-10-15T01:28:57.570

In Scala you can use implicits to add the methods isEmpty() and nonEmpty() to the DataFrame API, which will make the code a bit nicer to read.

object DataFrameExtensions {
  implicit def extendedDataFrame(dataFrame: DataFrame): ExtendedDataFrame = 
    new ExtendedDataFrame(dataFrame: DataFrame)

  class ExtendedDataFrame(dataFrame: DataFrame) {
    def isEmpty(): Boolean = dataFrame.head(1).isEmpty // Any implementation can be used
    def nonEmpty(): Boolean = !isEmpty
  }
}

Here, other methods can be added as well. To use the implicit conversion, use import DataFrameExtensions._ in the file you want to use the extended functionality. Afterwards, the methods can be used directly as so:

val df: DataFrame = ...
if (df.isEmpty) {
  // Do something
}

Bose · Answer 11 · 2019-06-10T12:30:51.257

4

On PySpark, you can also use this bool(df.head(1)) to obtain a True of False value

It returns False if the dataframe contains no rows

edited Jun 10 '19 at 12:30

answered Jun 10 '19 at 12:17

Bose

41
4

Yasin Uygun · Answer 12 · 2023-02-27T08:06:56.863

If you want only to find out whether the DataFrame is empty, then df.isEmpty, df.head(1).isEmpty() or df.rdd.isEmpty() should work, these are taking a limit(1) if you examine them:

== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)], output=[count#52L])
+- *(2) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#60L])
   +- *(2) GlobalLimit 1
      +- Exchange SinglePartition
         +- *(1) LocalLimit 1
            ... // the rest of the plan related to your computation

But if you are doing some other computation that requires a lot of memory and you don't want to cache your DataFrame just to check whether it is empty, then you can use an accumulator:

def accumulateRows(acc: LongAccumulator)(df: DataFrame): DataFrame =
  df.map { row => // we map to the same row, count during this map
    acc.add(1)
    row
  }(RowEncoder(df.schema))

val rowAccumulator = spark.sparkContext.longAccumulator("Row Accumulator")
val countedDF = df.transform(accumulateRows(rowAccumulator))
countedDF.write.saveAsTable(...) // main action
val isEmpty = rowAccumulator.isZero

Note that to see the row count, you should first perform the action. If we change the order of the last 2 lines, isEmpty will be true regardless of the computation.

score 0 · Answer 13 · answered Feb 05 '18 at 08:24

I found that on some cases:

>>>print(type(df))
<class 'pyspark.sql.dataframe.DataFrame'>

>>>df.take(1).isEmpty
'list' object has no attribute 'isEmpty'

this is same for "length" or replace take() by head()

[Solution] for the issue we can use.

>>>df.limit(2).count() > 1
False

score 0 · Answer 14 · answered Jul 30 '23 at 14:42

My case was a bit different and I want to share it with you all. My Dataframe was delivered empty however, there was a null value record. The dataframe is considered empty but it wasn't actually. Therefore I wrote the below code as a solution for my problem.

My Problem: When I issue df.count() I don't get 0 but one record with null values. If I issue df.rdd.isEmpty() I get False.

The Solution:

from pyspark.sql.functions import col,when
def isDfEmpty(df):
  if df.count() == 1: #When df has only one record
    _df_ = df.select([when(col(c)=="",None).otherwise(col(c)).alias(c) for c in df.columns]).na.drop('all')
    return(_df_.rdd.isEmpty())
  else:
    return False

isDfEmpty(df) #Replace df with your respective dataframe variable

Note: In my case I got only one record in the empty dataframe. If that is not the case please reconsider the if condition.

Unclear how this answers the actual question, then. Null values and empty strings aren't technically empty dataframes. Have you tried `dropNa` operation before counting? — OneCricketeer, Jul 30 '23 at 14:46

score -1 · Answer 15 · edited May 09 '17 at 05:44

-1

df1.take(1).length>0

The take method returns the array of rows, so if the array size is equal to zero, there are no records in df.

edited May 09 '17 at 05:44

Arya McCarthy

8,554
4
34
56

answered May 08 '17 at 17:20

Gopi A

9

score -1 · Answer 16 · answered Apr 21 '21 at 06:06

-1

Let's suppose we have the following empty dataframe:

df = spark.sql("show tables").limit(0)

If you are using Spark 2.1, for pyspark, to check if this dataframe is empty, you can use:

df.count() > 0

Or

bool(df.head(1))

answered Apr 21 '21 at 06:06

Joy Jedidja Ndjama

534
7
11

score -2 · Answer 17 · edited Dec 11 '17 at 02:40

-2

You can do it like:

val df = sqlContext.emptyDataFrame
if( df.eq(sqlContext.emptyDataFrame) )
    println("empty df ")
else 
    println("normal df")

edited Dec 11 '17 at 02:40

Stephen Rauch

47,830
31
106
135

answered Dec 11 '17 at 02:17

sYer Wang

1
2

1

won't it require the `schema` of two dataframes (`sqlContext.emptyDataFrame` & `df`) to be same in order to ever return `true`? – y2k-shubham Jan 22 '18 at 13:59
1

This won't work. `eq` is inherited from `AnyRef` and _tests whether the argument (that) is a reference to the receiver object (this)._ – Alper t. Turker Jan 30 '18 at 01:32

score -2 · Answer 18 · answered Apr 09 '20 at 06:31

-2

dataframe.limit(1).count > 0

This also triggers a job but since we are selecting single record, even in case of billion scale records the time consumption could be much lower.

From: https://medium.com/checking-emptiness-in-distributed-objects/count-vs-isempty-surprised-to-see-the-impact-fa70c0246ee0

answered Apr 09 '20 at 06:31

Jordan Morris

2,101
2
24
41

All these are bad options taking almost equal time – Pushpendra Jaiswal Jul 01 '20 at 19:56
@PushpendraJaiswal yes, and in a world of bad options, we should chose the best bad option – Jordan Morris Jul 02 '20 at 04:11

How to check if spark dataframe is empty?

18 Answers18

Linked

Related