26

I asked this same question on the mongodb-user list: http://groups.google.com/group/mongodb-user/browse_thread/thread/b3470d6a867cd24

I was hoping someone on this forum might have some insight...

I've run a simple experiment comparing the performance of cursor iteration using python vs. java and have found that the python implementation is about 10x slower. I was hoping someone could tell me if this difference is expected or if I'm doing something clearly inefficient on the python side.

The benchmark is simple: it performs a query, iterates over the cursor, and inspects the same field in each document. In the python version, I can inspect about 22k documents per second. In the java version, I can inspect about 220k documents per second.

I've seen a few similar questions about python performance and I've taken the advice and made sure I'm using the C extensions:

>>> import pymongo 
>>> pymongo.has_c() 
True 
>>> import bson 
>>> bson.has_c() 
True 

Finally, I don't believe the discrepancy is due to fundamental differences between python and java, at least at the level my test code. For example, if I store the queried documents in a python list, I can iterate over that list very quickly. In other words, it's not an inefficient python for-loop that accounts for the difference. Furthermore, I get almost identical performance Java vs. Python when inserting documents.

Here are a few more details about the query:

  • Both the python and java implementations use the same query on the same collection and run on the same machine.
  • The collection contains about 20 million documents.
  • The query returns about 2 million documents, i.e., I'm retrieving about 10% of the collection.
  • Each document contains three simple fields: a date and two strings.
  • The query is indexed and the time spent in the actual query is negligible for both the python and java implementations.It's the cursor iteration that accounts for the runtime.
Sam
  • 261
  • 2
  • 4
  • 7
    The Java driver may read the entire resultset into memory and the python driver stream the results. You could try setting the batch_size in both drivers. – Joshua Martell Mar 31 '12 at 02:51
  • 4
    Can you post the full code to both the Python and Java versions of code? We can see if others can duplicate your results. – Chris W. Jun 06 '12 at 06:14
  • 1
    Just to note, if you follow the link to the Google Groups thread, the code is posted (both languages) as well as reviewed by 10gen people and further tests performed. TL;DR 10gen testing showed a little over 2x slower with python than java and some of the differential in testing may come from the version of python used – Adam Comerford Jul 18 '12 at 23:02
  • Could you post these details: your test server (including CPU, RAM, etc), and also re-run the tests showing more timing info than just before and after `foo()` is called - I'm thinking, "what is the average time it takes for Java to perform one iteration of it's `while` loop? for python to perform one iteration of `for doc in curs`?" – cegfault Jul 19 '12 at 17:48
  • It would be _very_ helpful if you ran a profiler on your python code and posted the results. It looks like a tight loop, which CPython is usually pretty bad at. If you need reference for that, please consult http://docs.python.org/library/profile.html – t.dubrownik Aug 05 '12 at 19:16
  • Please provide a test case that I can execute myself, and I will look at it. – h4ck3rm1k3 Aug 17 '12 at 18:58

1 Answers1

2

Well looking at your post on Google Groups as well, here's my 2c:

  1. Python is slower than Java. Since Python is not typed, it's interpreter cannot do all the Java JIT "magic" and so it will always be slower at runtime.

  2. On the Google Groups thread it is stated that:

"The big surprise in the results is how the Python benchmark performance degrades when I insert shorter values. If anything, I would have expected the opposite. Comparatively, the Java numbers are essentially the same for long vs. short strings".

This can be misleading due to Mongo's asynchronous behaviour when it comes to writes. Make sure you set the same Write Concern when you fire those writes in both your Java and Python benchmarks (and preferably set it to SAFE_MODE). In other words, if you don't specifically set any Write Concern, make sure the driver's default value is the same in both Python and Java variants.

Shivan Dragon
  • 15,004
  • 9
  • 62
  • 103