Why are the results obtained by using the spark's QuantileDiscretizer grouped unevenly?

Question

I have a Dataset.

The feature columns are grouped using the org.apache.spark.ml.feature.QuantileDiscretizer class of spark 2.3.1, and the resulting model grouping results are not uniform.

The data reflected in the last packet is almost twice as much as the other packets, and I set 11 packets in the parameter, but only 10 packets are actually obtained.

Look at the program below.

import org.apache.spark.ml.feature.QuantileDiscretizer
import org.apache.spark.ml.feature.Bucketizer
val model = new QuantileDiscretizer()
    .setInputCol("features")
    .setOutputCol("level")
    .setNumBuckets(11)
    .setHandleInvalid("keep")
    .fit(df)
println(model.getSplits.mkString(", "))
model
    .transform(df)
    .groupBy("level")
    .count
    .orderBy("level")
    .show

The output:

-Infinity, 115.0, 280.25, 479.75, 712.5, 1000.0, 1383.37, 1892.75, 2690.93, 4305.0, Infinity
+-----+------+                                                                  
|level| count|
+-----+------+
| null|  9113|
|  0.0| 55477|
|  1.0| 52725|
|  2.0| 54657|
|  3.0| 53592|
|  4.0| 54165|
|  5.0| 54732|
|  6.0| 52915|
|  7.0| 54090|
|  8.0| 53393|
|  9.0|107369|
+-----+------+

Grouping the last set of data separately:

val df1 = df.where("features >= 4305.0")
val model1 = new QuantileDiscretizer()
    .setInputCol("features")
    .setOutputCol("level")
    .setNumBuckets(2)
    .setHandleInvalid("keep")
    .fit(df1)

println(model1.getSplits.mkString(", "))
model1
    .transform(df1)
    .groupBy("level")
    .count
    .orderBy("level")
    .show

The output:

-Infinity, 20546.12, Infinity
+-----+-----+                                                                   
|level|count|
+-----+-----+
|  0.0|53832|
|  1.0|53537|
+-----+-----+

If I manually specify the grouper boundaries to group:

val splits = Array(Double.NegativeInfinity, 
    115.0, 280.25, 479.75, 712.5, 1000.0, 1383.37, 1892.75, 2690.93, 4305.0, 
    20546.12, Double.PositiveInfinity)
val model = new Bucketizer()
    .setInputCol("features")
    .setOutputCol("level")
    .setHandleInvalid("keep")
    .setSplits(splits)
model
.transform(df)
.groupBy("level")
.count
.orderBy("level")
.show

The output:

+-----+-----+                                                                   
|level|count|
+-----+-----+
| null| 9113|
|  0.0|55477|
|  1.0|52725|
|  2.0|54657|
|  3.0|53592|
|  4.0|54165|
|  5.0|54732|
|  6.0|52915|
|  7.0|54090|
|  8.0|53393|
|  9.0|53832|
| 10.0|53537|
+-----+-----+

Please tell me why QuantileDiscretizer will behave like this?

What if I want to group the raw data evenly?

hjerp · Answer 1 · 2023-01-04T14:16:37.023

Set the relative error to a small number, e.g.,

qds = QuantileDiscretizer(
    numBuckets=10, 
    inputCol="score_rand",
    outputCol="buckets", 
    relativeError=0.0001, 
    handleInvalid="error")

I believe that if you still do not obtain almost even groups it is because there are ties within your bucketed column. Then try and add a small random number, larger than the relative error, and you should get the desired number of buckets.

Why are the results obtained by using the spark's QuantileDiscretizer grouped unevenly?

1 Answers1