Why does pyspark RandomForestClassifier featureImportance have more values than the number of input features?

Question

I have a pyspark dataframe that looks like this:

+---+-----+---+---+------+
|  a|    b|  c|  d|target|
+---+-----+---+---+------+
|  0|  one|0.0|  0|     0|
|  1|  two|0.1| -1|     1|
|  2|three|0.2| -2|     1|
|  3| four|0.3| -3|     0|
|  4| five|0.4| -4|     1|
+---+-----+---+---+------+

I did the necessary encoding of string columns (in this case column 'b'):

from pyspark.ml import feature 

cat_cols = ['b']
for cat_col in cat_cols:
    string_indexer = feature.StringIndexer(inputCol=cat_col, outputCol=cat_col+'_idx')
    model = string_indexer.fit(df)
    indexed = model.transform(df)

    ohe_encoder = feature.OneHotEncoder(inputCol=cat_col+'_idx', outputCol=cat_col+'_vec').fit(indexed)
    encoded = ohe_encoder.transform(indexed)
    df = encoded

followed by encoding the target ('target' column):

label_indexer = feature.StringIndexer(inputCol='target', outputCol='label')
label_model = label_indexer.fit(df)
label_indexed = label_model.transform(df)

and finally assembled all the input features using VectorAssembler:

assembler = feature.VectorAssembler(
    inputCols=['a','b_vec', 'c', 'd'],
    outputCol='features'
)

output = assembler.transform(label_indexed)

From here I trained a RF classifier model (from pyspark.ml):

rf = RandomForestClassifier(labelCol='label', featuresCol='features')

rf_fit = rf.fit(output)

Now, if I do rf_fit.featureImportances.toArray() I get:

array([0.29122807, 0.03912281, 0.19157895, 0.05263158, 0.04327485,
       0.25233918, 0.12982456])

My point of confusion is two-fold:

Firstly, I input 4 features but featureImportance gives me 7 values
Using this answer on the databricks forum, I get the correct number of features but the importances do NOT add up to 1

Can someone explain how this featureImportance works for pyspark?

Note, this stackoverflow answer gives the same solution as the datbricks and this answer creates feature names that are not in the actual column.

Note, this is using pyspark.ml NOT the mllib module

Why does pyspark RandomForestClassifier featureImportance have more values than the number of input features?

0 Answers0