0

I am getting the following message in Weka Explorer when I try to use training data to classify new test data:

Problem evaluating classifier:
Train and test set are not compatible
Attributed differ at position 6:
Labels differ at position 1: TRUE != FALSE

I am using a J48 classifier to classify RSS feeds according to popularity of keywords both in boolean form and numerically. The problem occurs only with the boolean variant. My training data is like this:

@relation _dm_3793_855329_11032013_1362993476361_Boolean-weka.filters.unsupervised.attribute.NumericToNominal-R65

@attribute bin {FALSE,TRUE}
@attribute kill {FALSE,TRUE}
@attribute laden {FALSE,TRUE}
@attribute video {FALSE,TRUE}
@attribute pakistan {FALSE,TRUE}
@attribute imf {TRUE,FALSE}
…

Whereas the equivalent testing data is:

@relation _dm_4993_179211_18032013_1363611143017_Boolean-weka.filters.unsupervised.attribute.NumericToNominal-R65

@attribute bin {FALSE,TRUE}
@attribute kill {FALSE,TRUE}
@attribute laden {FALSE,TRUE}
@attribute video {FALSE,TRUE}
@attribute pakistan {FALSE,TRUE}
@attribute imf {FALSE,TRUE}
…

For the last line with attribute ‘imf’, the labels are reversed so I assume that this is the cause of the problem: but how can I solve it?

Both training and testing data is labelled and a typical row resembles the following:

@data
FALSE,FALSE,FALSE,TRUE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE,FALSE, …, ‘Name of class’

My .arff files are created dynamically in Java code as follows:

// Create .arff file.
            CSVLoader loader = new CSVLoader();
            loader.setSource(new File(cf.getCsvFile()));
            Instances data = loader.getDataSet();            
            NumericToNominal numToNom = new NumericToNominal();
            String[] options = Utils.splitOptions("-R " + columnNames.length + ""); // Class attribute, if numeric, must be 'discretized'.          
            numToNom.setOptions(options);
            numToNom.setInputFormat(data);
            data = NumericToNominal.useFilter(data, numToNom);
            ArffSaver saver = new ArffSaver();
            saver.setInstances(data);
            saver.setFile(new File(cf.getArffFile()));
            saver.writeBatch();

So can anyone tell me if I’m using a filter incorrectly or am missing something? My equivalent .arff files for numeric frequencies, generated through the same code, are compatible.

Thanks

Mr Morgan.

Mr Gwent
  • 57
  • 1
  • 8

3 Answers3

0
For the last line with attribute ‘imf’, the labels are reversed 
so I assume that this is the cause of the problem: 
but how can I solve it?

Change test arff file so that it has same header with training file. Both training and testing file should have same header information apart from relation name. Make it so that last line of header is

@attribute imf {TRUE,FALSE}

I had the same problem see this question and answer. Basically you decide a header information and put it to another file. Then your every data set use same header information. Either do this using coding or create arff files by hand.

Community
  • 1
  • 1
Atilla Ozgur
  • 14,339
  • 3
  • 49
  • 69
  • Thanks for the reply but `@attribute imf {TRUE,FALSE}` is simply a line where the problem occurs. There could be several such label reversals in a testing file compared to its training file. Each file may have up to 196 attributes. There is also the possibility that some lines may only have `@attribute XXX {TRUE}` or `@attribute XXX {FALSE}`. I would like all the attributes to include TRUE and FALSE if possible. I wonder would changing the values to 1 and 0 be a better approach? – Mr Gwent Apr 05 '13 at 19:27
  • Thanks. I will look at this tomorrow. Changing the values to 1 and 0 is possibly another approach. – Mr Gwent Apr 05 '13 at 19:40
0

I had an email exchange with Eibe Frank of Weka yesterday re this point. The conversation can be seen here:

https://list.scms.waikato.ac.nz/pipermail/wekalist/2013-April/057698.html

He made several suggestions including using the same Weka NumericToNominal object for both training and testing datasets, or to preserve the same object by serialisation if training and testing datasets are generated at different times.

In my case, however, the best solution is to use integers 1 and 0 in place of TRUE and FALSE respectively, and regenerate the datasets affected.

Thanks to Atilla Ozgur though.

Mr Gwent
  • 57
  • 1
  • 8
0

If we substitute the TRUE and FALSE with 0 and 1. We could not use some algorithms which are based on nominal results.

user27379
  • 51
  • 1
  • 5