Very similar to this problem: Pyspark random forest feature importance mapping after column transformations but with arrays instead of categorical values
I'm running a feature importance test on my final model. The features were arrays (for example device which has elements "mobile" or "desktop" and city (elements are the cities visited, eg: London, New York,...) that I encoded using a CountVectorizer.
Afterwards, I ran a pipeline and constructed the featureImportances (similar to this: https://gist.github.com/colbyford/5443a525fe76b602f813ff7904c4dfff)
The end-result is the following:
idx name score
17 device_vector_0 0.483894
693 city_vector_69 0.001882
649 city_vector_25 0.001292
1172 city_vector_548 0.000000
1176 city_vector_552 0.000000
1177 city_vector_553 0.000000
My question is, how can I map the name (which is based on the array-element, I presume?) to the array-value (device_vector_0=mobile, city_vector_69=London,...).
Cleaned Code:
vector_list=list(set(['mobile','country']))
vectorizer=[CountVectorizer(inputCol=column, outputCol=column+"_vector").fit(df) for column in vector_list]
pipeline_vector = Pipeline(stages=vectorizer)
pipeline = Pipeline().setStages([pipeline_vector,assembler,rf])
bestModel=rf_model.bestModel
ExtractFeatureImportance(bestModel.stages[-1].featureImportances, resultDF, "features")