0

I am trying to optimize the performance of some natural language processing in a python project I am currently working on. Basically I would like to outsource the computationally intensive parts to use apache OpenNLP, which is written in Java.

My question is what would be the recommended way to link Java functions/classes back to my python code? The three main ways I have thought about are

  • using C/C++ bindings in python and then embedding a JVM in my C program. This is what I am leaning towards because I am somewhat familiar writing C extensions to python, but using a triangle of languages where C only functions as an intermediary doesn't seem right somehow.

  • using Jython. My main concern with this is that CPython is the overwhelmingly popular python implementation as far as I know and I don't want to break compatibility with other collaborators or packages.

  • streaming input and output to the binaries that come with OpenNLP. Apache provides tokenizers and such as stand-alone binaries that you can pipe data to and from. This would probably be the easiest option to implement, but it also seems like the most crude.

I'm wondering if anyone who has experience interfacing python and java knows how much the performance is likely to differ between these options, and which one is "recommended" or considered best practice in such a situation - or of course if there is an entirely different way to do it that I haven't thought of.

I did search SO for existing answers and found this, but it's an answer from 3.5 years ago and mentions some projects that are either dead, hard to integrate/configure/install or still under development.

Some comments mentioned that the overhead for all three methods is likely to be insignificant compared to the time required to run the actual NLP code. This is probably true, but I'm still interested in what the answer is from a more general perspective.

Thanks!

Community
  • 1
  • 1
Moritz
  • 4,565
  • 2
  • 23
  • 21
  • Uhm, I don't see that performance of bindings really matter here... Most time will be spent in OepnNLP itself, right? – fge Feb 25 '14 at 07:18
  • Hmm, I don't know - you tell me :) one thing to consider is that the way the code currently is, I call OpenNLP frequently to process small chunks of texts at a time, so thats why I thought optimizing the bindings might be important (all though in theory it would of course be possible to restructure my existing code to minimize the occurrence of calls.) I was also just curious what the "best practice" is for this type of scenario – Moritz Feb 25 '14 at 07:32
  • Well, _you_ tell in this case ;) If chunks are small and processed quickly, then yes, the overhead of the "cross-language call" is indeed a thing to be considered. – fge Feb 25 '14 at 07:33
  • Calling the behemoth Java to optimize performance sounds a bit contradictory to me. But let us know your results. – Hyperboreus Feb 25 '14 at 07:42

1 Answers1

0

Consider building a java server with existing language independent RPC mecahnism(thirift, ....). And use python as the RPC client to talk with the server. It has loose coupling。

yinqiwen
  • 604
  • 6
  • 6