1

I was hoping to use StringIndexer as a means of ranking the 1000+ categories in my data set, generating an index which signifies relative frequency. I could then use this index as a numeric feature for my model. Unfortunately StringIndex by default stores some metadata flagging the index as categorical, forcing my model to use the index as a category variable.

Is there some way of disabling this, so the index variable can be used as a numeric variable?

Edit: I am using string indexer as a stage in a ML pipeline, so a solution would need to avoid manipulating the data frame directly. Also I will be saving and loading this pipeline, so a custom data transformer may be impractical. I suspect this isn't possible as Spark is currently written.

Community
  • 1
  • 1

1 Answers1

4

You can index the data and then replace the metadata. Let's say your data looks like this:

import spark.implicits._
import org.apache.spark.ml.feature.StringIndexer

val indexer = new StringIndexer().setInputCol("raw").setOutputCol("indexed")

val df = Seq("a", "b", "b", "c", "c", "c").toDF("raw")
val indexed = indexer.fit(df).transform(df)

We'll need a NumericAttribute:

import org.apache.spark.ml.attribute.NumericAttribute

and metadata:

val meta = NumericAttribute.defaultAttr.withName("indexed").toMetadata

Finally we can replace metadata using as method:

indexed.withColumn("indexed", $"indexed".as("indexed", meta))
zero323
  • 322,348
  • 103
  • 959
  • 935