I have a dataframe like below
private val sample = Seq(
(1, "A B C D E"),
(1, "B C D"),
(1, "B C D E"),
(1, "B C D F"),
(1, "A B C"),
(1, "B C E F G")
)
I want to remove the least used words from the dataframe. For this i used tf-idf to calculate the least used word.
// Create the Tokenizer step
val tokenizer = new Tokenizer()
.setInputCol("regexTransformedColumn")
.setOutputCol("words")
// Create TF
val hashingTF = new HashingTF()
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("rawFeatures")
// Create TF IDF
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
// Create the pipeline
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, idf))
val lrModel = pipeline.fit(regexTransformedLabel)
val lrOutput = lrModel.transform(regexTransformedLabel)
I am getting output like below
+---------+---------------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+
|clusterId|words |rawFeatures |features |
+---------+---------------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+
|1 |[a, b, c, d, e]|(262144,[17222,27526,28698,30913,227410],[1.0,1.0,1.0,1.0,1.0])|(262144,[17222,27526,28698,30913,227410],[0.5596157879354227,0.3364722366212129,0.0,0.0,0.8472978603872037])|
|1 |[b, c, d] |(262144,[27526,28698,30913],[1.0,1.0,1.0]) |(262144,[27526,28698,30913],[0.3364722366212129,0.0,0.0]) |
|1 |[b, c, d, e] |(262144,[17222,27526,28698,30913],[1.0,1.0,1.0,1.0]) |(262144,[17222,27526,28698,30913],[0.5596157879354227,0.3364722366212129,0.0,0.0]) |
|1 |[b, c, d, f] |(262144,[24152,27526,28698,30913],[1.0,1.0,1.0,1.0]) |(262144,[24152,27526,28698,30913],[0.8472978603872037,0.3364722366212129,0.0,0.0]) |
|1 |[a, b, c] |(262144,[28698,30913,227410],[1.0,1.0,1.0]) |(262144,[28698,30913,227410],[0.0,0.0,0.8472978603872037]) |
|1 |[b, c, e, f, g]|(262144,[17222,24152,28698,30913,51505],[1.0,1.0,1.0,1.0,1.0]) |(262144,[17222,24152,28698,30913,51505],[0.5596157879354227,0.8472978603872037,0.0,0.0,1.252762968495368]) |
+---------+---------------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+
But how can i get the words from the transformed features, so that i can remove the least used words.
I will pass the max features to remove the words with tf-idf feature more than max features. If i give max features as 0.6, A(0.8) and G(1.2) should be removed from the data frame. But i couldn't convert the features to words so that i can remove the least used words.