I'm running a model using GLM (using ML in Spark 2.0) on data that has one categorical independent variable. I'm converting that column into dummy variables using StringIndexer
and OneHotEncoder
, then using VectorAssembler
to combine it with a continuous independent variable into a column of sparse vectors.
If my column names are continuous
and categorical
where the first is a column of floats and the second is a column of strings denoting (in this case, 8) different categories:
string_indexer = StringIndexer(inputCol='categorical',
outputCol='categorical_index')
encoder = OneHotEncoder(inputCol ='categorical_index',
outputCol='categorical_vector')
assembler = VectorAssembler(inputCols=['continuous', 'categorical_vector'],
outputCol='indep_vars')
pipeline = Pipeline(stages=string_indexer+encoder+assembler)
model = pipeline.fit(df)
df = model.transform(df)
Everything works fine to this point, and I run the model:
glm = GeneralizedLinearRegression(family='gaussian',
link='identity',
labelCol='dep_var',
featuresCol='indep_vars')
model = glm.fit(df)
model.params
Which outputs:
DenseVector([8440.0573, 3729.449, 4388.9042, 2879.1802, 4613.7646, 5163.3233, 5186.6189, 5513.1392])
Which is great, because I can verify that these coefficients are essentially correct (via other sources). However, I haven't found a good way to link these coefficients to the original column names, which I need to do (I've simplified this model for SO; there's more involved.)
The relationship between column names and coefficients is broken by StringIndexer
and OneHotEncoder
. I've found one fairly slow way:
df[['categorical', 'categorical_index']].distinct()
Which gives me a small dataframe relating the the string names to the numerical names, which I think I could then relate back to the keys in the sparse vector? This is very clunky and slow though, when you consider the scale of the data.
Is there a better way to do this?