I am getting the following message in Weka Explorer when I try to use training data to classify new test data:
Problem evaluating classifier:
Train and test set are not compatible
Attributed differ at position 6:
Labels differ at position 1: TRUE != FALSE
I am using a J48 classifier to classify RSS feeds according to popularity of keywords both in boolean form and numerically. The problem occurs only with the boolean variant. My training data is like this:
@relation _dm_3793_855329_11032013_1362993476361_Boolean-weka.filters.unsupervised.attribute.NumericToNominal-R65
@attribute bin {FALSE,TRUE}
@attribute kill {FALSE,TRUE}
@attribute laden {FALSE,TRUE}
@attribute video {FALSE,TRUE}
@attribute pakistan {FALSE,TRUE}
@attribute imf {TRUE,FALSE}
…
Whereas the equivalent testing data is:
@relation _dm_4993_179211_18032013_1363611143017_Boolean-weka.filters.unsupervised.attribute.NumericToNominal-R65
@attribute bin {FALSE,TRUE}
@attribute kill {FALSE,TRUE}
@attribute laden {FALSE,TRUE}
@attribute video {FALSE,TRUE}
@attribute pakistan {FALSE,TRUE}
@attribute imf {FALSE,TRUE}
…
For the last line with attribute ‘imf’
, the labels are reversed so I assume that this is the cause of the problem: but how can I solve it?
Both training and testing data is labelled and a typical row resembles the following:
@data
FALSE,FALSE,FALSE,TRUE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE,FALSE, …, ‘Name of class’
My .arff
files are created dynamically in Java code as follows:
// Create .arff file.
CSVLoader loader = new CSVLoader();
loader.setSource(new File(cf.getCsvFile()));
Instances data = loader.getDataSet();
NumericToNominal numToNom = new NumericToNominal();
String[] options = Utils.splitOptions("-R " + columnNames.length + ""); // Class attribute, if numeric, must be 'discretized'.
numToNom.setOptions(options);
numToNom.setInputFormat(data);
data = NumericToNominal.useFilter(data, numToNom);
ArffSaver saver = new ArffSaver();
saver.setInstances(data);
saver.setFile(new File(cf.getArffFile()));
saver.writeBatch();
So can anyone tell me if I’m using a filter incorrectly or am missing something? My equivalent .arff
files for numeric frequencies, generated through the same code, are compatible.
Thanks
Mr Morgan.