Label vectorized-features in pipeline to original array name (PySpark)

Question

Very similar to this problem: Pyspark random forest feature importance mapping after column transformations but with arrays instead of categorical values

I'm running a feature importance test on my final model. The features were arrays (for example device which has elements "mobile" or "desktop" and city (elements are the cities visited, eg: London, New York,...) that I encoded using a CountVectorizer.

Afterwards, I ran a pipeline and constructed the featureImportances (similar to this: https://gist.github.com/colbyford/5443a525fe76b602f813ff7904c4dfff)

The end-result is the following:

idx                      name           score

17                device_vector_0  0.483894

693                city_vector_69  0.001882

649               city_vector_25  0.001292

1172             city_vector_548  0.000000

1176             city_vector_552  0.000000

1177             city_vector_553  0.000000

My question is, how can I map the name (which is based on the array-element, I presume?) to the array-value (device_vector_0=mobile, city_vector_69=London,...).

Cleaned Code:

vector_list=list(set(['mobile','country']))

vectorizer=[CountVectorizer(inputCol=column, outputCol=column+"_vector").fit(df) for column in vector_list]

pipeline_vector = Pipeline(stages=vectorizer)

pipeline = Pipeline().setStages([pipeline_vector,assembler,rf])

bestModel=rf_model.bestModel

ExtractFeatureImportance(bestModel.stages[-1].featureImportances, resultDF, "features")

Please spend a moment to see how to properly format your code blocks (done it for you this time); also, please include explicitly your **imports** (is it really Spark MLlib, as per your tag, or Spark ML?). — desertnaut, Jul 22 '19 at 15:24
A bit too fast, it's spark ml (changed the tags). I think the issue is regarding counvectorizer-loop in the sense that I need to include the get_feature_names somewhere in the loop (for multiple input/outputCols). — BartDP, Jul 23 '19 at 08:25

Label vectorized-features in pipeline to original array name (PySpark)

0 Answers0