I'm trying to get the majority vote of a few different models for a binary classification problem.
I managed to create compile a spark table from a few different spark tables using
LR.createOrReplaceTempView("lr")
RF.createOrReplaceTempView("rf")
DT.createOrReplaceTempView("dt")
GBT.createOrReplaceTempView("gbt")
majority = spark.sql("SELECT lr.label, lr, rf, dt, gbt FROM lr, rf, dt, gbt")
The output of majority looks like
+-----+---+---+---+---+
|label| lr| rf| dt|gbt|
+-----+---+---+---+---+
| 0.0|0.0|0.0|0.0|0.0|
| 0.0|0.0|0.0|0.0|0.0|
| 0.0|0.0|0.0|0.0|0.0|
| 0.0|0.0|0.0|0.0|0.0|
| 0.0|0.0|0.0|0.0|0.0|
| 0.0|0.0|0.0|0.0|0.0|
| 0.0|0.0|0.0|0.0|0.0|
| 0.0|0.0|0.0|0.0|0.0|
| 0.0|0.0|0.0|0.0|0.0|
| 0.0|0.0|0.0|0.0|0.0|
+-----+---+---+---+---+
I'm trying to create a column that takes the majority vote (mode) from those four columns. I've looked into this post, but couldn't exactly get what I want.
Thanks so much for helping!