2

I want to run area under ROC test for my machine learning model, but the attribute error pops up. Below is my complete code with the error details include. I already have string indexer, one hot encoder and vector assembler on the flight. Please refer to the full code below:

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import pyspark.sql.types as T

spark = SparkSession.builder.getOrCreate()
    
df=spark.read.csv("2018-2010_import.csv",inferSchema=True,header=True)
    
train, test = df.randomSplit([0.7, 0.3], seed=7)
    
print(f"Train set length: {train.count()} records")
print(f"Test set length: {test.count()} records")

train.dtypes

catCols = [x for (x, dataType) in train.dtypes if dataType == "string"]
numCols = [
    x for (x, dataType) in train.dtypes if ((dataType == "double") & (x != "HSCode"))
]

print(numCols)
print(catCols)

train.agg(F.countDistinct("Commodity","Country")).show()

train.groupBy("Commodity","Country").count().show()

from pyspark.ml.feature import (
    OneHotEncoder,
    StringIndexer,
)

string_indexer = [
    StringIndexer(inputCol=x, outputCol=x + "_StringIndexer", handleInvalid="skip")
    for x in catCols
]

one_hot_encoder = [
    OneHotEncoder(
        inputCols=[f"{x}_StringIndexer" for x in catCols],
        outputCols=[f"{x}_OneHotEncoder" for x in catCols],
    )
]

from pyspark.ml.feature import VectorAssembler

assemblerInput = [x for x in numCols]
assemblerInput += [f"{x}_OneHotEncoder" for x in catCols]

vector_assembler = VectorAssembler(
    inputCols=assemblerInput, outputCol="VectorAssembler_features", handleInvalid="skip"
)

stages = []
stages += string_indexer
stages += one_hot_encoder
stages += [vector_assembler]

from pyspark.ml import Pipeline

pipeline = Pipeline().setStages(stages)
model = pipeline.fit(train)

pp_df = model.transform(test)

pp_df.select(
    "HSCode", "Commodity", "value", "Country", "VectorAssembler_features",
).show(truncate=False)
from pyspark.ml.classification import LogisticRegression

data = pp_df.select(
    F.col("VectorAssembler_features").alias("features"),
    F.col("HSCode").alias("label"),
)

model = LogisticRegression().fit(data)

model_summary.areaUnderROC

AttributeError Traceback (most recent call last) C:\Users\AZMANM~1\AppData\Local\Temp/ipykernel_4856/3039136250.py in ----> 1 model_summary.areaUnderROC AttributeError: 'LogisticRegressionTrainingSummary' object has no attribute 'areaUnderROC'

model.summary.pr.show()

AttributeError Traceback (most recent call last) C:\Users\AZMANM~1\AppData\Local\Temp/ipykernel_4856/3388404637.py in ----> 1 model.summary.pr.show()

AttributeError: 'LogisticRegressionTrainingSummary' object has no attribute 'pr'

2 Answers2

1

There's no code that tells us how you are getting the model_summary variable.

Did you maybe forget to use model.summary.areaUnderROC instead of model_summary.areaUnderROC?

The following example works for me:

from pyspark.sql import Row, SparkSession
from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import LogisticRegression

if __name__ == "__main__":

    spark = SparkSession.builder.getOrCreate()
    sc = spark.sparkContext
    bdf = sc.parallelize(
        [
            Row(label=1.0, weight=1.0, features=Vectors.dense(0.0, 5.0)),
            Row(label=0.0, weight=2.0, features=Vectors.dense(1.0, 2.0)),
            Row(label=1.0, weight=3.0, features=Vectors.dense(2.0, 1.0)),
            Row(label=0.0, weight=4.0, features=Vectors.dense(3.0, 3.0)),
        ]
    ).toDF()
    blor = LogisticRegression(weightCol="weight")
    blorModel = blor.fit(bdf)
    summary = blorModel.summary
    aur = summary.areaUnderROC
vladsiv
  • 2,718
  • 1
  • 11
  • 21
  • I have use the 'model.summary.areaUnderROC' code but the error still persist. When I use your code, this error shows up @Vlad Siv: Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 51.0 failed 1 times, most recent failure: Lost task 0.0 in stage 51.0 (TID 288) (10.30.1.95 executor driver): java.io.IOException: Cannot run program "python3": CreateProcess error=2, The system cannot find the file specified – Azman Mahyuddin Oct 18 '21 at 09:12
  • @AzmanMahyuddin I'm not sure how you are running the code and what's your setup. Please try this: [-- java.io.IOException: Cannot run program "python3": CreateProcess error=2, The system cannot find the file specified](https://stackoverflow.com/questions/68705417/pycharm-error-java-io-ioexception-cannot-run-program-python3-createprocess) – vladsiv Oct 18 '21 at 09:40
0

You will need to use BinaryClassificationEvaluator. After the train test split, I have named the training set as train_set and testing data as test_set. Here input_columns are all the columns apart from the label column.

from pyspark.ml.evaluation import BinaryClassificationEvaluator
assembler= VectorAssembler(inputCols=input_columns,outputCol='features')

And call on the vector assembler to transform your dataframe

    final_data = assembler.transform(your_dataframe)
    print("Train test Split...")
    train,test = final_data.randomSplit([0.7,0.3], seed=4000)
    lr = LogisticRegression(labelCol="label", 
    featuresCol="features",maxIter=10 ,threshold=0.5)
    lr_model=lr.fit(train_set)
    predict_train=lr_model.transform(train_set)
    predict_test=lr_model.transform(test_set)
        
    evaluator = BinaryClassificationEvaluator()
    print("Test Area Under ROC: " + str(evaluator.evaluate(predict_test, {evaluator.metricName: "areaUnderROC"})))
Nidhi
  • 561
  • 4
  • 7
  • I receive a new error. @Nidhi Which is `IndentationError: unexpected indent`. When I remove the indent, this new error pop up : NameError: name 'train_set' is not defined I am sorry for asking basic question. I am new to machine learning. – Azman Mahyuddin Oct 18 '21 at 09:23
  • train_set is your training data - the data on which you train your model. Just like in scikit learn, you split your data to train_set and test_set – Nidhi Oct 18 '21 at 10:25
  • the following error pop up 'IllegalArgumentException: features does not exist. Available: HSCode, Commodity, value, Country, year' . The Available listed is the features column inside the data. For this, how to settle it down? I already fit my training and testing inside the code – Azman Mahyuddin Oct 20 '21 at 02:23
  • Do you copy, @Nidhi? – Azman Mahyuddin Oct 21 '21 at 02:09
  • Can you tell me where exactly are you getting the error ? – Nidhi Oct 21 '21 at 09:34
  • I already edited the full set of the coding above, @Nidhi. I already put the string indexer, one hot encoder and vector assembler to transform the feature before. Have a look. When i tried to synchronized with your coding, another error pop up, which is 'NameError: name 'BinaryClassificationEvaluator' is not defined' – Azman Mahyuddin Oct 22 '21 at 08:06
  • Try importing it first. I have added the code. – Nidhi Oct 22 '21 at 08:56