9

I'm running a model using GLM (using ML in Spark 2.0) on data that has one categorical independent variable. I'm converting that column into dummy variables using StringIndexer and OneHotEncoder, then using VectorAssembler to combine it with a continuous independent variable into a column of sparse vectors.

If my column names are continuous and categorical where the first is a column of floats and the second is a column of strings denoting (in this case, 8) different categories:

string_indexer = StringIndexer(inputCol='categorical', 
                               outputCol='categorical_index')

encoder = OneHotEncoder(inputCol ='categorical_index',
                        outputCol='categorical_vector')

assembler = VectorAssembler(inputCols=['continuous', 'categorical_vector'],
                            outputCol='indep_vars')

pipeline  = Pipeline(stages=string_indexer+encoder+assembler)
model = pipeline.fit(df)
df = model.transform(df)

Everything works fine to this point, and I run the model:

glm = GeneralizedLinearRegression(family='gaussian', 
                                  link='identity',
                                  labelCol='dep_var',
                                  featuresCol='indep_vars')
model = glm.fit(df)
model.params

Which outputs:

DenseVector([8440.0573, 3729.449, 4388.9042, 2879.1802, 4613.7646, 5163.3233, 5186.6189, 5513.1392])

Which is great, because I can verify that these coefficients are essentially correct (via other sources). However, I haven't found a good way to link these coefficients to the original column names, which I need to do (I've simplified this model for SO; there's more involved.)

The relationship between column names and coefficients is broken by StringIndexer and OneHotEncoder. I've found one fairly slow way:

df[['categorical', 'categorical_index']].distinct()

Which gives me a small dataframe relating the the string names to the numerical names, which I think I could then relate back to the keys in the sparse vector? This is very clunky and slow though, when you consider the scale of the data.

Is there a better way to do this?

Jeff
  • 2,158
  • 1
  • 16
  • 29

4 Answers4

8

For PySpark, here is the solution to map feature index to feature name:

First, train your model:

pipeline = Pipeline().setStages([label_stringIdx,assembler,classifier])
model = pipeline.fit(x)

Transform your data:

df_output = model.transform(x)

Extract the mapping between feature index and feature name. Merge numeric attributes and binary attributes into a single list.

numeric_metadata = df_output.select("features").schema[0].metadata.get('ml_attr').get('attrs').get('numeric')
binary_metadata = df_output.select("features").schema[0].metadata.get('ml_attr').get('attrs').get('binary')

merge_list = numeric_metadata + binary_metadata 

OUTPUT:

[{'name': 'variable_abc', 'idx': 0},
{'name': 'variable_azz', 'idx': 1},
{'name': 'variable_azze', 'idx': 2},
{'name': 'variable_azqs', 'idx': 3},
  ....
pierre_comalada
  • 300
  • 3
  • 11
4

I also came across the exact problem and I've got your solution :)

This is based on the Scala version here: How to map variable names to features after pipeline

# transform data
best_model = pipeline.fit(df)
best_pred = best_model.transform(df)

# extract features metadata
meta = [f.metadata 
    for f in best_pred.schema.fields 
    if f.name == 'features'][0]

# access feature name and index
features_name_ind = meta['ml_attr']['attrs']['numeric'] + \
    meta['ml_attr']['attrs']['binary']

print features_name_ind[:2]
# [{'name': 'feature_name_1', 'idx': 0}, {'name': 'feature_name_2', 'idx': 1}]
YardenR
  • 301
  • 2
  • 13
2

I didn't investigate the previous versions, but in Spark 2.4.3 it is possible to retrieve a lot of information about the features just by using the summary attribute of a GeneralizedLinearRegressionModel.

Printing summary results in something like this:

Coefficients:
            Feature Estimate Std Error T Value P Value
        (Intercept)  -0.1742    0.4298 -0.4053  0.6853
  x1_enc_(-inf,5.5]  -0.7781    0.3661 -2.1256  0.0335
   x1_enc_(5.5,8.5]   0.1850    0.3736  0.4953  0.6204
   x1_enc_(8.5,9.5]  -0.3937    0.4324 -0.9106  0.3625
 x45_enc_1-10-7-8-9  -0.5382    0.2718 -1.9801  0.0477
   x45_enc_2-3-4-ND   0.5187    0.2811  1.8454  0.0650
          x45_enc_5  -0.0456    0.3353 -0.1361  0.8917
          x33_enc_1   0.6361    0.4043  1.5731  0.1157
         x33_enc_10   0.0059    0.4083  0.0145  0.9884
 x33_enc_2-3-4-8-ND   0.6121    0.1741  3.5152  0.0004
x102_enc_(-inf,4.5]   0.5315    0.1695  3.1354  0.0017

(Dispersion parameter for binomial family taken to be 1.0000)
    Null deviance: 937.7397 on 666 degrees of freedom
Residual deviance: 858.8846 on 666 degrees of freedom
AIC: 880.8846

The Feature column can be constructed by accessing an internal Java object:

In [131]: glm.summary._call_java('featureNames')
Out[131]:
['x1_enc_(-inf,5.5]',
 'x1_enc_(5.5,8.5]',
 'x1_enc_(8.5,9.5]',
 'x45_enc_1-10-7-8-9',
 'x45_enc_2-3-4-ND',
 'x45_enc_5',
 'x33_enc_1',
 'x33_enc_10',
 'x33_enc_2-3-4-8-ND',
 'x102_enc_(-inf,4.5]']

The Estimate column can be constructed by the following concatenation:

In [134]: [glm.intercept] + list(glm.coefficients)
Out[134]:
[-0.17419580191414719,
 -0.7781490190325139,
 0.1850214800764976,
 -0.3936963366945294,
 -0.5382255101657534,
 0.5187453074755956,
 -0.045649677050663987,
 0.6360647167539958,
 0.00593020879299306,
 0.6121475986933201,
 0.531510974697773]

PS.: This line shows why the column Features can be retrieved by using an internal Java object.

boechat107
  • 1,654
  • 14
  • 24
0

Sorry, this seems to be a very late answer and maybe you might have already figured it out but wth, anyways. I recently did the same implementation of String Indexer, OneHotEncoder and VectorAssembler and as far as I have understood, the following code will present what you are looking for.

from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler

categoricalColumns = ["one_categorical_variable"]
stages = [] # stages in the pipeline


for categoricalCol in categoricalColumns:

# Category Indexing with StringIndexer
stringIndexer = StringIndexer(inputCol=categoricalCol, 
    outputCol=categoricalCol+"Index")

# Using OneHotEncoder to convert categorical variables into binary 
    SparseVectors

encoder = OneHotEncoder(inputCol=stringIndexer.getOutputCol(), 
    outputCol=categoricalCol+"classVec")

# Adding the stages so that they will be run all at once later

stages += [stringIndexer, encoder]

# convert label into label indices using the StringIndexer

label_stringIdx = StringIndexer(inputCol = "Service_Level", outputCol = 
    "label")
stages += [label_stringIdx]

# Transform all features into a vector using VectorAssembler

numericCols = ["continuous_variable"]
assemblerInputs = map(lambda c: c + "classVec", categoricalColumns) + 
    numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

# Creating a Pipeline for Training

pipeline = Pipeline(stages=stages)

# Running the feature transformations.
pipelineModel = pipeline.fit(df)
df = pipelineModel.transform(df)
Terry Mafura
  • 147
  • 2
  • 15
  • well the thing is when you see a entry in the 'feature' column in the df (the one in the last row) how are you going to tie it back to the original feature name? ( – mathopt May 22 '17 at 01:08