Fatal error using python-javabridge JVM in Celery thread with NLTK on Mac OS X

Question

I am using the Python wrapper for Weka which is based on python-javabridge. I have a long task to perform and, therefore, I am using Celery to do so. The problem is I get

A fatal error has been detected by the Java Runtime Environment:

  SIGSEGV (0xb) at pc=0x00007fff91a3c16f, pid=11698, tid=3587

JRE version:  (8.0_31-b13) (build )
Java VM: Java HotSpot(TM) 64-Bit Server VM (25.31-b07 mixed mode bsd-amd64 compressed oops)
Problematic frame:
C  [libdispatch.dylib+0x616f]  _dispatch_async_f_slow+0x18b

Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

If you would like to submit a bug report, please visit:
    http://bugreport.java.com/bugreport/crash.jsp
The crash happened outside the Java Virtual Machine in native code.
See problematic frame for where to report the bug.

when starting the JVM inside the thread. These two lines of code are used in order to do so (from weka.core.jvm):

javabridge.start_vm(run_headless=True)
javabridge.attach()

From what I've read, it is probably caused by the fact that the JVM is not attached to the Celery thread. However, javabridge.attach() is indeed run inside it.

What am I missing ?

EDIT: I identified the code that is causing trouble. It has to do with an NLTK tokenizer. The following code (according to Vebjorn's answer) will reproduce the error:

# hello.py
from nltk.tokenize import RegexpTokenizer
import javabridge
from celery import Celery

app = Celery('hello', broker='amqp://guest@localhost//', backend='amqp')

started = False    

@app.task
def hello():
    global started
    if not started:
        print 'Starting the VM'
        javabridge.start_vm(run_headless=True)
        started = True

    sentence = "This is a sentence with some numbers like 1, 2 or and some weird symbols like @, $ or ! :)"
    tokenizer = RegexpTokenizer(r'\w+')
    tokenized_sentence = tokenizer.tokenize(sentence.lower())
    print "Tokens:", tokenized_sentence

    return javabridge.run_script('java.lang.String.format("Hello, %s!", greetee);',
                             dict(greetee='world'))

Without starting the JVM, the code runs properly. It also works when not running as a Celery task. I don't understand why it crashes.

EDIT 2: It actually works in a clean Ubuntu environment (Dockerized) but not on Mac OS X Yosemite (v10.3).

EDIT 3: As mentioned in the comments, it works if from nltk.tokenize import RegexpTokenizer is done inside the task wrapper, that is inside the hello() function.

Your modified example works for me, even in my Mac, if I move the NLTK import inside the task (i.e., inside the `hello` function). — Vebjorn Ljosa, Apr 28 '15 at 08:52
Yes, it indeed works. Thanks for your first answer, it really helped to figure out how it worked and was actually the problem (or at least its origin). I however still don't understand why it doesn't work when the import is done outside the task wrapper and I would be really interested in the answer...someday. — Victor, Apr 28 '15 at 14:35

score 2 · Accepted Answer · answered Apr 21 '15 at 10:14

By default, Celery starts four separate worker processes. (See the -c command line option to celery worker.) You need to ensure that you start the JVM in all of them. This example works for me:

# hello.py
import os
import threading
from celery import Celery
import javabridge

app = Celery('hello', broker='amqp://guest@localhost//', backend='amqp')

started = False

@app.task
def hello():
    global started
    if not started:
        print 'Starting the VM'
        javabridge.start_vm(run_headless=True)
        started = True
    return javabridge.run_script('java.lang.String.format("Hello, %s!", greetee);',
                                 dict(greetee='world'))

and

# client.py
from hello import hello

r = hello.delay()
print r.get(timeout=1)

Install on a virgin Ubuntu 14.04 machine:

$ sudo apt-get update -y
$ sudo apt-get install -y openjdk-7-jdk python-pip python-numpy python-dev rabbitmq-server
$ sudo pip install celery javabridge
$ sudo /etc/init.d/rabbitmq-server start

Start worker:

$ celery -A hello worker
...
 -------------- celery@a7cc1bedc40d v3.1.17 (Cipater)
---- **** ----- 
--- * ***  * -- Linux-3.16.7-tinycore64-x86_64-with-Ubuntu-14.04-trusty
-- * - **** --- 
- ** ---------- [config]
- ** ---------- .> app:         hello:0x7f5464766b50
- ** ---------- .> transport:   amqp://guest:**@localhost:5672//
- ** ---------- .> results:     amqp
- *** --- * --- .> concurrency: 4 (prefork)
-- ******* ---- 
--- ***** ----- [queues]
 -------------- .> celery           exchange=celery(direct) key=celery


[2015-04-21 10:04:31,262: WARNING/MainProcess] celery@a7cc1bedc40d ready.

In another window, run a client five times:

 $ python client.py 
 Hello, world!
 $ python client.py 
 Hello, world!
 $ python client.py 
 Hello, world!
 $ python client.py 
 Hello, world!
 $ python client.py 
 Hello, world!

Observe in the worker window that the JVM is started on the first four calls from the client (which go to four difference processes) but not in the fifth:

[2015-04-21 10:05:53,491: WARNING/Worker-1] Starting the VM
[2015-04-21 10:05:55,028: WARNING/Worker-2] Starting the VM
[2015-04-21 10:05:56,411: WARNING/Worker-3] Starting the VM
[2015-04-21 10:05:57,318: WARNING/Worker-4] Starting the VM

Thanks, Vebjorn. I could make your example work on my computer, but it doesn't work when I adapt my problematic code to it. Probably related to Weka then. — Victor, Apr 21 '15 at 13:15

Fatal error using python-javabridge JVM in Celery thread with NLTK on Mac OS X

1 Answers1