9

I created a Spark Dataset[Long]:

scala> val ds = spark.range(100000000)
ds: org.apache.spark.sql.Dataset[Long] = [id: bigint]

When I ran ds.count it gave me result in 0.2s (on a 4 Core 8GB machine). Also, the DAG it created is as follows:

enter image description here

But, when I ran ds.rdd.count it gave me result in 4s (same machine). But the DAG it created is as follows:

enter image description here

So, my doubts are:

  1. Why ds.rdd.count is creating only one stage whereas ds.count is creating 2 stages ?
  2. Also, when ds.rdd.count is having only one stage then why it is slower than ds.count which has 2 stages ?
zero323
  • 322,348
  • 103
  • 959
  • 935
himanshuIIITian
  • 5,985
  • 6
  • 50
  • 70
  • 1
    AFAIK `ds.rdd.count` is very slow because the entire dataset is evaluated (i.e. all columns of all rows) although this not really necessary just to get the number of rows. Dataset/Dataframe API optimises this query considerable (see also http://stackoverflow.com/questions/42714291/how-to-force-dataframe-evaluation-in-spark) – Raphael Roth May 13 '17 at 20:11

2 Answers2

11

Why ds.rdd.count is creating only one stage whereas ds.count is creating 2 stages ?

Both counts are effectively two step operations. The difference is that in case of ds.count, the final aggregation is performed by one of the executors, while ds.rdd.countaggregates the final result on the driver, therefore this step is not reflected in the DAG:

Also, when ds.rdd.count is having only one stage then why it is slower

Ditto. Moreover ds.rdd.count has to initialize (and later garbage collect) 100 million Row objects, what is hardly free and probably accounts for majority of the time difference here.

Finally range-like objects are not a good benchmarking tool, unless used with a lot of caution. Depending on the context count over range can be expressed as a constant time operation and even without explicit optimizations can be extremely fast (see for example spark.sparkContext.range(0, 100000000).count) but don't reflect performance with a real workload.

Related to: How to know which count query is the fastest?

Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935
  • Thanks for response! It is a nice explanation of my question. And thanks for the suggestion too about `spark.sparkContext.range(0, 100000000).count`. It actually works. But my real issue was creating nested JSON files with Spark SQL. With Spark Dataset I can do it fast, but with RDD its a real pain ! – himanshuIIITian May 13 '17 at 15:02
  • 2
    Related question: is my understanding correct and rdd.count will run a distributed job, however ds.count will bring all data to a driver? To answer my own question: its not true :) – Konstantin Kulagin Jan 15 '18 at 22:07
1

New in spark 3.3.0

pyspark.sql.DataFrame.isEmpty

Returns True if this DataFrame is empty.

https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.isEmpty.html

WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560