Weka StringToWordVector attributes omitted

Question

I´m working with Weka. My problem is, that some of the attributes are omitted after using StringToWordVector. So here is my code:

This is the ARFF file before using any filter:

@relation QueryResult

@attribute class {Qualität,Bord,Kite,Harness}
@attribute text {evo,foil,end,fin,edg}

@data
Qualität,evo
Bord,foil
Kite,end
Harness,fin
Qualität,edg

Here is my java code:

 Instances train = new Instances(loadInstancesForWeka("root","",sqlCommand));
 train.setClassIndex(train.numAttributes() - 2);
 System.out.println(train);

 NominalToString filter1 = new NominalToString();
 filter1.setInputFormat(train);
 train = Filter.useFilter(train, filter1);
 System.out.println("\nSelect nach NominaltoString \n"+train); 

 //filter
 StringToWordVector filter = new StringToWordVector(); 
 filter.setInputFormat(train);
 train = Filter.useFilter(train, filter);

After using the Vector it looks like this:

@relation 'QueryResult-weka.filters.unsupervised.attribute.NominalToString-Clast-weka.filters.unsupervised.attribute.StringToWordVector-R2-W1000-prune-rate-1.0-N0-stemmerweka.core.stemmers.NullStemmer-stopwords-handlerweka.core.stopwords.Null-M1-tokenizerweka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\"'

@attribute class {Qualität,Bord,Kite,Harness}
@attribute edg numeric
@attribute evo numeric
@attribute foil numeric
@attribute end numeric
@attribute fin numeric

@data
{2 1}
{0 Bord,3 1}
{0 Kite,4 1}
{0 Harness,5 1}
{1 1}

So why are the attributes "foil,end,fin" omitted? Thank you for your help.

My code is based on this thread: https://stackoverflow.com/questions/41935193/simple-text-classification-using-naive-bayes-weka-in-java — Sarah Xoxo, Aug 31 '18 at 13:57

nekomatic · Accepted Answer · 2018-09-03T09:14:05.067

There aren't any attributes omitted from your output. The output is in sparse ARFF format:

Sparse ARFF files are very similar to ARFF files, but data with value 0 are not explicitly represented. ...

Each instance is surrounded by curly braces, and the format for each entry is:
[index] [space] [value] where index is the attribute index (starting from 0).

So for the third instance in your example,

{0 Kite,4 1}

means that attribute 0 for this instance is Kite, attribute 4 (i.e. 'end') is 1, and the other attributes are 0.

It makes sense for StringToWordVector to produce sparse output because it creates a lot of new attributes, most of which will be 0 for each instance. If you need the non-sparse version you can use weka.filters.unsupervised.instance.SparseToNonSparse.

Thank you that helped me a lot. I have another question. I want the data to be classified in categories (as above "Qualität,..."): but I also want to have the prediction whether they are positive/ negative/ neutral. Do I have to do the classfication again or is there a special method for this? — Sarah Xoxo, Sep 05 '18 at 15:25

Weka StringToWordVector attributes omitted

1 Answers1