2

I have a few arff files. I would like to read them sequentially and create a large dataset. Instances.add(Instance inst) doesn't add string values to the instances, hence the attempt to setDataset() ... but even this fails. Is there a way to accomplish the intuitively correct thing for strings?

                ArffLoader arffLoader = new ArffLoader();
                arffLoader.setFile(new File(fName));
                Instances newData = arffLoader.getDataSet();
                for (int i = 0; i < newData.numInstances(); i++) {
                        Instance one = newData.instance(i);
                        one.setDataset(data);
                        data.add(one);
                }
fodon
  • 4,565
  • 12
  • 44
  • 58
  • How does it fail? What happens? If there's an error, post it. Also, setDataset() won't work unless the datasets are identical (e.g., if the features are in a different order, it will fail), and if the datasets _are_ identical, then there's no need to set the dataset. – kaz Jun 07 '12 at 03:09
  • The failure is that strings don't get updated. The data structures are identical. – fodon Jun 07 '12 at 03:43
  • I believe the strings are all set to 0. – fodon Jun 07 '12 at 03:49
  • What strings? `String` the Java data type? You're inserting an `Instance` to a set of `Instances`, so I'm having trouble knowing what you mean. What do you do to check whether the strings you're talking about get updated? that code can help us. – kaz Jun 07 '12 at 04:14
  • An Instance can have a string field. Typically strings added to an Instance are stored in the associated Instances and a reference to a distinct string is added to the the Instance to save memory. – fodon Jun 07 '12 at 05:40
  • Do you mean a nominal attribute? That is the only reference to strings listed in the [documentation](http://bio.informatics.indiana.edu/ml_docs/weka/weka.core.Instance.html). And if so, do the two data sets have the same set of nominal values for that attribute, or different sets (e.g. `{"red","green","blue"}` in both, or `{"red","green"}` in one and `{"blue"}` in another)? – kaz Jun 07 '12 at 21:21

1 Answers1

6

This is from mailing list. I saved it before

how to merge two data file a.arff and b.arff into one data list?

Depends what merge you are talking about. Do you just want to append the second file (both have the same attributes) or do you want to add the merge the attributes (both have the same number of instances)?

In the first case ("append"): 
java weka.core.Instances append filename1 filename2 > output-file 

and the latter case ("merge"): 
java weka.core.Instances merge filename1 filename2 > output-file 

Here's the relevant Javadoc: http://weka.sourceforge.net/doc.dev/weka/core/Instances.html#main(java.lang.String[])

Use mergeInstances to merge two datasets.

 public static Instances mergeInstances(Instances first,
                                   Instances second)

Your code would be something like below. For same instance numbers.

ArffLoader arffLoader = new ArffLoader();
arffLoader.setFile(new File(fName1));
Instances newData1 = arffLoader.getDataSet();
arffLoader.setFile(new File(fName2));
Instances newData2 = arffLoader.getDataSet();
Instances mergedData = Instances.mergeInstances( newData1 ,newData2);       

Your code would be something like below. For same attribute numbers. I do not see any java method in weka. If you read code there is something like below.

// Instances.java
//  public static void main(String[] args) {
// read two files, append them and print result to stdout
  else if ((args.length == 3) && (args[0].toLowerCase().equals("append"))) {
DataSource source1 = new DataSource(args[1]);
DataSource source2 = new DataSource(args[2]);
String msg = source1.getStructure().equalHeadersMsg(source2.getStructure());
if (msg != null)
  throw new Exception("The two datasets have different headers:\n" + msg);
Instances structure = source1.getStructure();
System.out.println(source1.getStructure());
while (source1.hasMoreElements(structure))
  System.out.println(source1.nextElement(structure));
structure = source2.getStructure();
while (source2.hasMoreElements(structure))
  System.out.println(source2.nextElement(structure));
  }
Atilla Ozgur
  • 14,339
  • 3
  • 49
  • 69