1

I am trying to plot the ROC curve for a gradient boosting model. I have come across this post but it doesn't seem to work for the GBTclassifier model. pyspark extract ROC curve?

I am using a dataset in databricks and below is my code. It gives the following error

AttributeError: 'PipelineModel' object has no attribute 'summary'

%fs ls databricks-datasets/adult/adult.data

from pyspark.sql.functions import *
from pyspark.ml.classification import  RandomForestClassifier, GBTClassifier
from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator, VectorAssembler, VectorSlicer
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator,MulticlassClassificationEvaluator
from pyspark.ml.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
import pandas as pd

dataset = spark.table("adult")
# spliting the train and test data frames 
splits = dataset.randomSplit([0.7, 0.3])
train_df = splits[0]
test_df = splits[1]

def predictions(train_df,
                     target_col, 
                    ):
  """
  #Function attributes
  dataframe        - training df
  target           - target varibale in the model
  """



  # one hot encoding and assembling
  encoding_var = [i[0] for i in train_df.dtypes if (i[1]=='string') & (i[0]!=target_col)]
  num_var = [i[0] for i in train_df.dtypes if ((i[1]=='int') | (i[1]=='double')) & (i[0]!=target_col)]

  string_indexes = [StringIndexer(inputCol = c, outputCol = 'IDX_' + c, handleInvalid = 'keep') for c in encoding_var]
  onehot_indexes = [OneHotEncoderEstimator(inputCols = ['IDX_' + c], outputCols = ['OHE_' + c]) for c in encoding_var]
  label_indexes = StringIndexer(inputCol = target_col, outputCol = 'label', handleInvalid = 'keep')
  assembler = VectorAssembler(inputCols = num_var + ['OHE_' + c for c in encoding_var], outputCol = "features")
  gbt = GBTClassifier(featuresCol = 'features', labelCol = 'label',
                     maxDepth = 5, 
                     maxBins  = 45,
                     maxIter  = 20)


  pipe = Pipeline(stages = string_indexes + onehot_indexes + [assembler, label_indexes, gbt])

  model = pipe.fit(train_df)

  return model

gbt_model = predictions(train_df = train_df,
                     target_col = 'income')

import matplotlib.pyplot as plt
plt.figure(figsize=(5,5))
plt.plot([0, 1], [0, 1], 'r--')
plt.plot(gbt_model.summary.roc.select('FPR').collect(),
         gbt_model.summary.roc.select('TPR').collect())
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.show()
Gun
  • 169
  • 3
  • 12

1 Answers1

0

Based on your error, have a look at PipelineModel in this doc: https://spark.apache.org/docs/2.4.3/api/python/pyspark.ml.html#pyspark.ml.PipelineModel

There is not attribute summary in on an object of this class. Instead, I believe you need to access the stages of the PipelineModel individually, such as gbt_model.stages[-1] (which should give access to your last stage - the GBTClassifier. Then try and play around with the attributes there, such as:

gbt_model.stages[-1].summary

And if your GBTClassifier has a summary, you'll find it there. Hope this helps.

Napoleon Borntoparty
  • 1,870
  • 1
  • 8
  • 28
  • This won't work because `GBTClassifier` doesn't have a `summary` attribute. Built-in ROC curves are only implemented for Random Forest and Logistic Regression. – numbermaniac Oct 04 '22 at 04:17