I am using Weka for my document classification research. I need to set a baseline on which I will show that my contribution improves classification. However, using default Latent Semantic Analysis in the Weka API results in an OutOfMemory error.
After performing some preprocessing, my dataset consists out of 25,765 attributes used in 9,603 instances. This is for the train set, for the test set I have same number of class and normal attributes but here I have 3,299.
I have 8GB of ram and have set the Java Heap Size to 4Gb already and yet I still get OutOfMemory error. Here is the error message:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at weka.core.matrix.Matrix.getArrayCopy(Matrix.java:301)
at weka.core.matrix.SingularValueDecomposition.<init>(SingularValueDecomposition.java:76)
at weka.core.matrix.Matrix.svd(Matrix.java:913)
at weka.attributeSelection.LatentSemanticAnalysis.buildAttributeConstructor(LatentSemanticAnalysis.java:511)
at weka.attributeSelection.LatentSemanticAnalysis.buildEvaluator(LatentSemanticAnalysis.java:416)
at weka.attributeSelection.AttributeSelection.SelectAttributes(AttributeSelection.java:596)
at weka.filters.supervised.attribute.AttributeSelection.batchFinished(AttributeSelection.java:455)
at weka.filters.Filter.useFilter(Filter.java:682)
at test.main(test.java:44)
I have tested my code with a smaller dataset and there everything works as it should, so it is not a code-related problem. Could someone explain how I can scale up LSA to fit my requirements? Or is there another, similar process I can apply that is more scalable?