2

I build a machine learning model to classify documents using NaiveBayesMultinomial. I am using Java Weka Api to train and test model. To evaluate model performance I want to generate ROC curve. I do not understand how to calculate TPR and FPR for different threshold values. I attached my source code and sample dataset. I would be very grateful if anyone help me to calculate TPR and FPR for different threshold values for generating ROC curve. Thanks in advance for your help. My Java Code:

    package smote;
    import java.io.File;
    import java.util.Random;
    import weka.classifiers.Classifier;
    import weka.classifiers.bayes.NaiveBayesMultinomial;
    import weka.core.Instance;
    import weka.core.Instances;
    import weka.core.converters.ConverterUtils.DataSource;
    import weka.filters.Filter;
    import weka.filters.unsupervised.attribute.StringToWordVector;
    public class calRoc {
        public static void main(String agrs[]) throws Exception{
            String fileRootPath = "...../DocsFIle.arff";
            Instances rawData = DataSource.read(fileRootPath);
            StringToWordVector filter = new StringToWordVector(10000);
            filter.setInputFormat(rawData);
            String[] options = { "-W", "10000", "-L", "-M", "2",
                            "-stemmer", 
            "weka.core.stemmers.IteratedLovinsStemmer", 
                            "-stopwords-handler", 
            "weka.core.stopwords.Rainbow", 
                            "-tokenizer", 
            "weka.core.tokenizers.AlphabeticTokenizer" 
                            };
            filter.setOptions(options);
            filter.setIDFTransform(true);
            filter.setStopwords(new 

      File("/Research/DoctoralReseacher/IEICE/Dataset/stopwords.txt"));
      Instances data = Filter.useFilter(rawData,filter);
      data.setClassIndex(0);        

      int numRuns = 10;
      double[] recall=new double[numRuns];
      double[] precision=new double[numRuns];
      double[] fmeasure=new double[numRuns];
      double tp, fp, fn, tn;
      String classifierName[] = { "NBM"};
      double totalPrecision,totalRecall,totalFmeasure;
     totalPrecision=totalRecall=totalFmeasure=0;
     double avgPrecision, avgRecall, avgFmeasure;
     avgPrecision=avgRecall=avgFmeasure=0;                 
     for(int run = 0; run < numRuns; run++) {
        Classifier classifier = null;
        classifier = new NaiveBayesMultinomial();
        int folds = 10;         
        Random random = new Random(1);
        data.randomize(random);
        data.stratify(folds);
        tp = fp = fn = tn = 0;
        for (int i = 0; i < folds; i++) {
            Instances trains = data.trainCV(folds, i,random);
            Instances tests = data.testCV(folds, i);
            classifier.buildClassifier(trains);             
            for (int j = 0; j < tests.numInstances(); j++) {
                Instance instance = tests.instance(j);                  
                double classValue = instance.classValue();                  
                double result = classifier.classifyInstance(instance);
                if (result == 0.0 && classValue == 0.0) {
                    tp++;
                } else if (result == 0.0 && classValue == 1.0) {
                    fp++;
                } else if (result == 1.0 && classValue == 0.0) {
                    fn++;
                } else if (result == 1.0 && classValue == 1.0) {
                    tn++;
                }
            }   
        }

        if (tn + fn > 0)
            precision[run] = tn / (tn + fn);
        if (tn + fp > 0)
            recall[run] = tn / (tn + fp);
        if (precision[run] + recall[run] > 0)
            fmeasure[run] = 2 * precision[run] * recall[run] / (precision[run] + recall[run]);
        System.out.println("The "+(run+1)+"-th run");
        System.out.println("Precision: " + precision[run]);
        System.out.println("Recall: " + recall[run]);
        System.out.println("Fmeasure: " + fmeasure[run]);
        totalPrecision+=precision[run];
        totalRecall+=recall[run];
        totalFmeasure+=fmeasure[run];

     }
     avgPrecision=totalPrecision/numRuns;
     avgRecall=totalRecall/numRuns;
     avgFmeasure=totalFmeasure/numRuns;
     System.out.println("avgPrecision: " + avgPrecision);
     System.out.println("avgRecall: " + avgRecall);
     System.out.println("avgFmeasure: " + avgFmeasure);
    }

}

Sample Dataset with few instances:

@relation 'CamelBug'

@attribute Feature string

@attribute class-att {0,1}

@data

'XQuery creates an empty out message that makes it impossible to chain 
 more processors behind it ',1

'org apache camel Message hasAttachments is buggy ',0

'unmarshal new JaxbDataFormat com foo bar returning JAXBElement ',0

'Can t get the soap header when the camel cxf endpoint working in the 
  PAYLOAD data fromat ',0

'camel jetty Exchange failures should not be returned as ',1
'Delayer not working as expected ',1
'ParallelProcessing and executor flags are ignored in Multicast 
  processor ',1 
Reja
  • 534
  • 1
  • 9
  • 18
  • You sound a little confused; ROC curves, by default, are calculated across all possible threshold values, and not for specific ones - see https://stackoverflow.com/questions/47104129/getting-a-low-roc-auc-score-but-a-high-accuracy/47111246#47111246 for some explanation – desertnaut Oct 26 '18 at 08:20
  • Thanks for your reply @desertnaut . Yes, I am little confused about how exactly I can calculate TPR and FPR for different threshold values. I am not using default evaluation functions to generate ROC, I want to calculate TPR and FPR for each threshold point using TP, FP, FN, TN. I want to calculate TPR and FPR like as precision and recall using Java Code (see Source Code). – Reja Oct 26 '18 at 22:26
  • ROC is pointless with classification methods, as the decision has already been made by the algorithm and there is no threshold to vary any longer. You need to get access to the class probabilities or some raw, quantitative score. I don't know how you'd do that with weka. Maybe try to find a regression algorithm instead. – Calimo Oct 27 '18 at 07:38
  • Thanks for your comment and suggestion. – Reja Nov 02 '18 at 11:31

0 Answers0