2

I have this function to read a doc file using tika on linux:

def read_doc(doc_path):
    output_path=doc_path+'.txt'
    java_path='/home/jdk1.7.0_17/jre/bin/'
    environ = os.environ.copy()
    environ['JAVA_HOME'] =java_path
    environ['PATH'] =java_path
    tika_path=java_path+'tika-app-1.3.jar'
    shell_command='java -jar %s --text --encoding=utf-8 "%s" >"%s"'%(tika_path,doc_path,output_path)
    proc=subprocess.Popen(shell_command,shell=True, env=environ,cwd=java_path)
    proc.wait()

This function works fine when I run it from the command line, but when I call the same function using CGI, I get the following error:

Error occurred during initialization of VM Could not reserve enough space for object heap

I checked previous answers for this particular error and they suggest increasing the memory, but this doesn't seem to work...I don't think this has to do with memory allocation, but rather some read/write/execute privilages from the cgi script, any idea how to solve this problem?

hmghaly
  • 1,411
  • 3
  • 29
  • 47
  • Would you not be better off running the Tika server, so there's only one startup cost, and having python pipe the data over to Tika for processing? – Gagravarr Apr 27 '13 at 18:49
  • Thanks, looks like a good idea, I managed to run the server but don't know to actually use it... – hmghaly Apr 28 '13 at 12:38
  • It's documented, along with examples, on the [Tika Wiki](http://wiki.apache.org/tika/TikaJAXRS) – Gagravarr Apr 28 '13 at 18:04
  • For some reason I am unable to send a request to the server and get something back as shown in the examples, I need to press ctrl+z to break – hmghaly Apr 29 '13 at 10:50

2 Answers2

3

You're loading an entire JVM instance within the memory & process space of each individual CGI invocation. That's bad. Very bad. For both performance and memory usage. Increasing memory allocation is a hack that doesn't address the real problem. Core java code should almost never be invoked via CGI.

You'd be better off:

  • Avoiding both CGI and Python by running a java Servlet within your web server that invokes the appropriate Tika class directly with desired arguments. Map the user url directly to the servlet (via @WebServlet("someURL") annotation on the Servlet class).
  • Running Tika in server mode and invoking it via REST from Python.
  • Running a core java app separately as a server/daemon proces, have it listen on a TCP ServerSocket. Invoke from Python via a client socket.
Community
  • 1
  • 1
Glen Best
  • 22,769
  • 3
  • 58
  • 74
  • Looks good, I think running tika in server mode is best, and I was able to start it, but I am unable to use netcat to do the following command (I do not have root access to install netcat): nc 127.0.0.1 12345 < MyFileToExtract – hmghaly May 03 '13 at 15:03
-1

Try to add -Xmx512m and -XX:MaxHeapSize=256m to the shell command. So that the shell command looks like this.

shell_command = 'java -XX:MaxHeapSize=256m -Xmx512m -jar %s --text --encoding=utf-8 "%s" >"%s"'%(tika_path,doc_path,output_path)

Jordan Jambazov
  • 3,460
  • 1
  • 19
  • 40