4

I want to classify certain data into different classes based on its content. I did it using naive bayes classifier and I get an output as the best category to which it belongs. But now I want to classify the news other than those in the training set into "others" class. I can't manually add each/every data other than the training data into a certain class since it has vast number of other categories.So is there any way to classify the other data?.

private static File TRAINING_DIR = new File("4news-train");
private static File TESTING_DIR = new File("4news-test");
private static String[] CATEGORIES = { "c1", "c2", "c3", "others" };

private static int NGRAM_SIZE = 6;

public static void main(String[] args) throws ClassNotFoundException, IOException {
    DynamicLMClassifier<NGramProcessLM> classifier = DynamicLMClassifier.createNGramProcess(CATEGORIES, NGRAM_SIZE);
    for (int i = 0; i < CATEGORIES.length; ++i) {
        File classDir = new File(TRAINING_DIR, CATEGORIES[i]);
        if (!classDir.isDirectory()) {
            String msg = "Could not find training directory=" + classDir + "\nTraining directory not found";
            System.out.println(msg); // in case exception gets lost in shell
            throw new IllegalArgumentException(msg);
        }

        String[] trainingFiles = classDir.list();
        for (int j = 0; j < trainingFiles.length; ++j) {
            File file = new File(classDir, trainingFiles[j]);
            String text = Files.readFromFile(file, "ISO-8859-1");
            System.out.println("Training on " + CATEGORIES[i] + "/" + trainingFiles[j]);
            Classification classification = new Classification(CATEGORIES[i]);
            Classified<CharSequence> classified = new Classified<CharSequence>(text, classification);
            classifier.handle(classified);
        }
    }
}
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
lulu
  • 177
  • 1
  • 1
  • 9
  • Not sure what you are asking. Your training set is compared of C1,C2,C3 categories only, and you want to classify to 4 categories: C1,C2,C3,others? – amit Feb 18 '14 at 10:01
  • I would strongly recommend to get pencil and make sure you understand what calculations need to be done. The challenge you are facing has not got anything to do with the code but with the calculations so your question might be best suited for http://stats.stackexchange.com/ See the notes below if you need any help with calculations: http://www.inf.ed.ac.uk/teaching/courses/inf2b/lectureSchedule.html – matcheek Feb 18 '14 at 10:05
  • @matcheek I believe the question is in fact about the LingPipe library, not about naive bayes itself. – Jakub Kotowski Feb 18 '14 at 10:08
  • @matcheek this is not only about lingpipe library but also about naive bayes.I want to classify all those data other than those belongs to c1,c2,c3 into the category "others". Iam just asking how can I implement it – lulu Feb 18 '14 at 10:18
  • I had build an intermediate model which avoids frequent training. So into that model I specify the testing part.This code is what I have tried first.I trained the contents in different folders in c1 I specify data about c1 and train it.Like wise I have to train "others" too.So I have to build a training data to "others" folder too. So I have to collect a large amount of data other than those related to c1,c2 and c3 for training.There should be some limit right – lulu Feb 18 '14 at 10:29

2 Answers2

1

Naive Bayes gives you the "confidence" in each classification, as it computes

P(y|x) ~ P(y)P(x|y)

Up to the normalization by P(x) it is a probability of x being a part of class y. You can simply cut-off on this value and say, that

cl(x) = "other" iff max_{over y}(P(y|x)) < T

where T can be for example minimum confidence on the training set

T = min_{over x and y in Training set}( P(y|x) )
lejlot
  • 64,777
  • 8
  • 131
  • 164
  • I think, from his/her code, that the question is how to do that using the LingPipe library. – Jakub Kotowski Feb 18 '14 at 10:04
  • I don't think `T = min_{over x and y in Training set}( P(y|x) )` is a good idea, this is biased, you are very confedent on the results you trained on... You could do it while using cross-validation though – amit Feb 18 '14 at 10:04
  • @amit if you are training on labeled data, than it is reasonable to assume, that these should not go to "others" in any scenario, yet of course this is just a simplest idea/approach, which can be "smoothed" in dozens of ways. We could for example analyze the whole P(y|x) on the training data (with CV or not), and select threshold to cut off outliers, for example by confidence ellipse etc. – lejlot Feb 18 '14 at 10:08
  • @jkbkot - the problem stated here is not implementational, as it is not a "default" usage of NB. As the result, the core problem is to solve the actual task on the ideological level, implementation is a whole different aspect. – lejlot Feb 18 '14 at 10:11
  • Yes, I agree that the question is stated sloppily :) If you look at the code and at the users's other questions, it looks more like a question about LingPipe. – Jakub Kotowski Feb 18 '14 at 10:13
  • @jkbkot..right.. I'm asking what you mentioned above – lulu Feb 18 '14 at 10:21
0

Just serialize the object...it means write the intermediate object to a file and that will be your model...

Then for testing you just need to pass the data into the model no need to train each time...It will be quite easier for you

chopss
  • 771
  • 9
  • 19