1

I am trying to add a new language To Automatic Language Detection tool Apache's tika. It needs to build a language profile for adding a new language. So i am using nutch language-identifier plug-in to build this profile.

The command is the following:

bin/nutch plugin language-identifier org.apache.nutch.analysis.lang.NGramProfile -create ./language-detection-profile/jp ./language-detection-profile/japanese4ngram-1.txt utf-8

Where ./language-detection-profile/japanese4ngram-1.txt is the new language corpus.

I have tested on a small size corpus (1 MB), and everything is fine, the profile is created as I expected.

However, when the corpus is large (> 1 GB). I have the problem of out of memory (heap space), like

Exception in thread "main" java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.nutch.plugin.PluginRepository.main(PluginRepository.java:421)
Caused by: java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2882)
    at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
    at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
    at java.lang.StringBuilder.append(StringBuilder.java:119)
    at org.apache.nutch.analysis.lang.NGramProfile.create(NGramProfile.java:374)
    at org.apache.nutch.analysis.lang.NGramProfile.main(NGramProfile.java:484)
    ... 5 more

Does anyone know how to specify heap space size for nutch's plugin? Thanks.

Edit: With the help from Mikaveli. In Ubuntu: set

if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then
  NUTCH_OPTS="$NUTCH_OPTS -Djava.library.path=$JAVA_LIBRARY_PATH -Xmx2048m"
fi
user200340
  • 3,301
  • 13
  • 52
  • 74
  • http://stackoverflow.com/questions/6450132/java-seems-to-ignore-xms-and-xmx-options/6450260#6450260 Add the -Xmx option when running your JVM. – Michael Aug 18 '11 at 09:05
  • Hi Mikaveli, "bin/nutch plugin language-identifier org.apache.nutch.analysis.lang.NGramProfile -Xmx2048m -create ./language-detection-profile/jp ./language-detection-profile/japanese4ngram-1.txt utf-8" and got the same error. I thought it should be specified at the nutch-site.xml – user200340 Aug 18 '11 at 09:38
  • Are you running the plugin in Eclipse? – Michael Aug 18 '11 at 09:42

1 Answers1

1

Assuming you're developing on a Windows box, edit nutch.bat and add the following after the rem NUTCH_OPTS line:

set NUTCH_OPTS=%NUTCH_OPTS% -Xmx1024m

Obviously set the amount of RAM within the physical limit of your machine - note that Nutch can easily require 4g, depending on what you're doing with it.

Michael
  • 7,348
  • 10
  • 49
  • 86
  • Thanks. I am using Ubuntu. So i set if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then NUTCH_OPTS="$NUTCH_OPTS -Djava.library.path=$JAVA_LIBRARY_PATH -Xmx2048m" fi – user200340 Aug 18 '11 at 10:11
  • Yep, that's it. Now anyone with the same issue on either platform has the answer. :) – Michael Aug 18 '11 at 10:15
  • Nutch executable file has a variable in it: JAVA_HEAP_MAX (at least in the latest version). So there is no need to modify anything except for it. – Jevgenij Evll May 20 '13 at 11:32