I use MALLET for topic modeling.
http://mallet.cs.umass.edu/topics.php
First, I try to import the training document set following the instruction.
bin/mallet import-dir --input /data/topic-input --output topic-input.mallet --keep-sequence --remove-stopwords
I always get OutOfMemoryError
, although I change "bin/mallet.bat"
according to the following page.
Mallet topic modelling
I set set MALLET_MEMORY=32G
.
My data set size is 30GB.
Computer memory is sufficient.
I get the following error.
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOfRange(Arrays.java:3658)
at java.lang.String.<init>(String.java:201)
at java.lang.AbstractStringBuilder.substring(AbstractStringBuilder.java:909)
at java.lang.StringBuffer.subSequence(StringBuffer.java:473)
at cc.mallet.extract.StringSpan.constructTokenText(StringSpan.java:49)
at cc.mallet.extract.StringSpan.<init>(StringSpan.java:33)
at cc.mallet.pipe.CharSequence2TokenSequence.pipe(CharSequence2TokenSequence.java:68)
at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:294)
at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:282)
at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:290)
at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:282)
at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:290)
at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:282)
at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:290)
at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:282)
at cc.mallet.types.InstanceList.addThruPipe(InstanceList.java:267)
at cc.mallet.classify.tui.Text2Vectors.main(Text2Vectors.java:312)
$ bin/mallet import-dir --input ../Text --output topic-input.mallet --keep-sequence --remove-stopwords
Labels =
../Text
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3658)
at java.lang.String.<init>(String.java:201)
at java.lang.AbstractStringBuilder.substring(AbstractStringBuilder.java:909)
at java.lang.StringBuffer.subSequence(StringBuffer.java:473)
at cc.mallet.extract.StringSpan.constructTokenText(StringSpan.java:49)
at cc.mallet.extract.StringSpan.<init>(StringSpan.java:33)
at cc.mallet.pipe.CharSequence2TokenSequence.pipe(CharSequence2TokenSequence.java:68)
at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:294)
at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:282)
at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:290)
at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:282)
at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:290)
at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:282)
at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:290)
at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:282)
at cc.mallet.types.InstanceList.addThruPipe(InstanceList.java:267)
at cc.mallet.classify.tui.Text2Vectors.main(Text2Vectors.java:312)
How can I fix this problem? Thank you.