I created a Spark Dataset[Long]
:
scala> val ds = spark.range(100000000)
ds: org.apache.spark.sql.Dataset[Long] = [id: bigint]
When I ran ds.count
it gave me result in 0.2s
(on a 4 Core 8GB machine). Also, the DAG it created is as follows:
But, when I ran ds.rdd.count
it gave me result in 4s
(same machine). But the DAG it created is as follows:
So, my doubts are:
- Why
ds.rdd.count
is creating only one stage whereasds.count
is creating 2 stages ? - Also, when
ds.rdd.count
is having only one stage then why it is slower thands.count
which has 2 stages ?