1

I am trying to plot the feature importances of random forest classifier with with column names. I am using Spark 2.3.2 and Pyspark.

The input X is sentences and i am using tfidf (HashingTF + IDF) + StringIndexer to generate the feature vectors.

I have included all the stages in a Pipeline.

regexTokenizer = RegexTokenizer(gaps=False,  \
                        inputCol= raw_data_col, \
                        outputCol= "words",  \
                        pattern="[a-zA-Z_]+", \
                        toLowercase=True, \
                        minTokenLength=minimum_token_size)

hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=number_of_feature)
idf = IDF(inputCol="rawFeatures", outputCol= feature_vec_col)

indexer = StringIndexer(inputCol= label_col_name, outputCol= label_vec_name)
converter = IndexToString(inputCol='prediction', outputCol="original_label", labels=indexer.fit(df).labels)

feature_pipeline = Pipeline(stages=[regexTokenizer, hashingTF, idf, indexer])

estimator = RandomForestClassifier(labelCol=label_col, featuresCol=features_col, numTrees=100)

pipeline = Pipeline(stages=[feature_pipeline, estimator, converter])

model = pipeline.fit(df)

Generating the feature importances as

rdc = model.stages[-2]
print (rdc.featureImportances)

So far so good, but when i try to map the feature importances to the feature columns using the example in this and this questions as below

attrs = sorted((attr["idx"], attr["name"]) for attr in (chain(*df_pred.schema["featurescol"].metadata["ml_attr"]["attrs"].values())))

[(name, rdc.featureImportances[idx])
   for idx, name in attrs
   if dtModel_1.featureImportances[idx]]

I get the key error on ml_attr

KeyError: 'ml_attr'

The printed the dictionary,

print (df_pred.schema["featurescol"].metadata)

and it's empty {}

Any thoughts on what I am doing wrong ? How can I getting feature importances to the columns names.

Thanks

Praveen
  • 2,137
  • 1
  • 18
  • 21

1 Answers1

0

I have not been able to resolve the empty metadata issue but for mapping the feature importances of random forest classifier with column names - i am getting it with the below code:

feature_importances = model.stages[-2].featureImportances
feature_imp_array = feature_importances.toArray()

feat_imp_list = []
for feature, importance in zip(tf_model.vocabulary, feature_imp_array):
    feat_imp_list.append((feature, importance))

feat_imp_list = sorted(feat_imp_list, key=(lambda x: x[1]), reverse=True)

top_features = feat_imp_list[0:50]
Praveen
  • 2,137
  • 1
  • 18
  • 21