Pyspark: How to extract readable feature importance from spark-ml Random Forest

Asked Oct 18 '19 at 16:21

Active Oct 21 '19 at 16:18

Viewed 350 times

From this question pyspark-mllib-random-forest-feature-importances I see there is a method called featureImportances that return a SparseVector.

The output is something like this:

SparseVector(2, {0: 0.6, 1:0.4})

My question is how can I associate the name of the columns with the original name of the function? Is there a way to extract the columns names from the RandomForestClassifier object?

EDIT: The model is the second stage of a pipeline. The first stage is a VectorAssembler object used to define the input columns for the model.

edited Oct 21 '19 at 16:18

asked Oct 18 '19 at 16:21

paolof89

1,319
5
17
31

random forest takes on input two columns, label and features column of type Vector, so what you mean about "column names" ? – chlebek Oct 18 '19 at 21:09
That's exactly the problem, I lose the feature names in the previous step of the pipeline when I use the VectorAssembler object. I'll edit the question – paolof89 Oct 21 '19 at 07:32

Pyspark: How to extract readable feature importance from spark-ml Random Forest

0 Answers0