You are of course right. The quote is just incorrect. filter
does preserve partitioning (for the reason you've already described), and it is trivial to confirm that
val rdd = sc.range(0, 10).map(x => (x % 3, None)).partitionBy(
new org.apache.spark.HashPartitioner(11)
)
rdd.partitioner
// Option[org.apache.spark.Partitioner] = Some(org.apache.spark.HashPartitioner@b)
val filteredRDD = rdd.filter(_._1 == 3)
filteredRDD.partitioner
// Option[org.apache.spark.Partitioner] = Some(org.apache.spark.HashPartitioner@b)
rdd.partitioner == filteredRDD.partitioner
// Boolean = true
This stays in contrast to operations like map
, which don't preserver partitioning (Partitioner
):
rdd.map(identity _).partitioner
// Option[org.apache.spark.Partitioner] = None
Datasets
are a bit more subtle, as filters are normally pushed-down, but overall the behavior is similar.