I was hoping to use StringIndexer
as a means of ranking the 1000+ categories in my data set, generating an index which signifies relative frequency. I could then use this index as a numeric feature for my model. Unfortunately StringIndex
by default stores some metadata flagging the index as categorical, forcing my model to use the index as a category variable.
Is there some way of disabling this, so the index variable can be used as a numeric variable?
Edit: I am using string indexer as a stage in a ML pipeline, so a solution would need to avoid manipulating the data frame directly. Also I will be saving and loading this pipeline, so a custom data transformer may be impractical. I suspect this isn't possible as Spark is currently written.