8

I am trying to build a classifier using SVM light which classifies a document in one of the two classes. I have already trained and tested the classifier and a model file is saved to the disk. Now I want to use this model file to classify completely new documents. What should be the input file format for this? Could it be plain text file (I don't think that would work) or could be it just plain listing of features present in the text file without any class label and feature weights (in that case I have to keep track of the indices of features in feature vector during training) or is it some other format?

ritesh
  • 229
  • 1
  • 5
  • 12

2 Answers2

7

Training and testing files must be of the same format, each instance results in a line of the following form:

<line> .=. <target> <feature>:<value> ... <feature>:<value> # <info>
<target> .=. +1 | -1 | 0 | <float> 
<feature> .=. <integer> | "qid"
<value> .=. <float>
<info> .=. <string>

For example (copy pasta from SVM^light website):

-1 1:0.43 3:0.12 9284:0.2 # abcdef

You can consult the SVM^light website for more information.

Marc Claesen
  • 16,778
  • 6
  • 27
  • 62
  • Marc I am not trying to "test" the classifier here. I want to use it now for the practical purpose of classifying completely unknown documents. In a "test" file I know the class to which the document belongs and so I can prepare the file accordingly. When I am trying to do a "real" classification, I do not know the class of the document and feature values (lets say if I am using tf-idf values in the training and testing phase then there is no idf value if it is a completely unknown document). So what would be the format of the file then? – ritesh Aug 20 '13 at 17:45
  • @ritesh Using a classifier is generally called the *testing phase*, even if you aren't interested in assessing its accuracy. You can either omit the first column (not sure of SVM^light allows this, I know libsvm does), or use a value of your choice there (definitely works). The labels are only used to report an accuracy. So if you don't have them, just use your favorite number but be aware that any reported accuracy is completely bogus. – Marc Claesen Aug 20 '13 at 18:12
  • I must admit that I am really confused now. Lets say I put any number in the first column (instead of a class label). But then how do I calculate the feature values [the format is `:....:`]? For training I am using tf-idf as well as class frequency for calculating this value which takes into account total number of training documents as well as total number of training documents in the class to which this document belongs. For testing could this be a value calculated in a way different from that used in training? If yes, what could this value possibly be? – ritesh Aug 21 '13 at 12:29
  • 1
    How did you make the training set? Make the test set in the same way ... I fail to see what confuses you. You *must* preprocess your test set in the *exact same way* as you did for the training set. Calculate tf based on the test documents and normalize based on the idf you used for the training set. – Marc Claesen Aug 21 '13 at 12:57
  • Ok that makes the things clear. I was not able to understand how would I get the idf value but now I see that it is the same as for the training. Thanks a lot Marc for your patience and time! – ritesh Aug 21 '13 at 13:55
  • @ritesh Could you give an example of a small text file and explain how to build the training file? I'm stuck at understanding how to translate my text document to this format. Your help would be very much appreciated. – rottweiler Jan 26 '15 at 08:31
  • he wants to make predictions. Not actually test anything, so he doesn't have (didn't have) the class. – marbel Aug 25 '15 at 00:29
2

The file format to make predictions is the same as the one to make test and train, i.e.

<line> .=. <target> <feature>:<value> ... <feature>:<value> # <info>
<target> .=. +1 | -1 | 0 | <float> 
<feature> .=. <integer> | "qid"
<value> .=. <float>
<info> .=. <string>

But to make prediction the target is unknow, thus you have to use 0 value as target. Thi is the only difference. I hope this helps someone

Nick
  • 1,439
  • 2
  • 15
  • 28