I am trying to add a new language To Automatic Language Detection tool Apache's tika. It needs to build a language profile for adding a new language. So i am using nutch language-identifier plug-in to build this profile.
The command is the following:
bin/nutch plugin language-identifier org.apache.nutch.analysis.lang.NGramProfile -create ./language-detection-profile/jp ./language-detection-profile/japanese4ngram-1.txt utf-8
Where ./language-detection-profile/japanese4ngram-1.txt is the new language corpus.
I have tested on a small size corpus (1 MB), and everything is fine, the profile is created as I expected.
However, when the corpus is large (> 1 GB). I have the problem of out of memory (heap space), like
Exception in thread "main" java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.nutch.plugin.PluginRepository.main(PluginRepository.java:421) Caused by: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2882) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390) at java.lang.StringBuilder.append(StringBuilder.java:119) at org.apache.nutch.analysis.lang.NGramProfile.create(NGramProfile.java:374) at org.apache.nutch.analysis.lang.NGramProfile.main(NGramProfile.java:484) ... 5 more
Does anyone know how to specify heap space size for nutch's plugin? Thanks.
Edit: With the help from Mikaveli. In Ubuntu: set
if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then NUTCH_OPTS="$NUTCH_OPTS -Djava.library.path=$JAVA_LIBRARY_PATH -Xmx2048m" fi