0

For some graph algorithm I need to fetch a lot of records from a database to memory (~ 1M records). I want this to be done fast and I want the records to be objects (that is: I want ORM). To crudely benchmark different solutions I created a simple problem of one table with 1M Foo objects like I did here: Why is loading SQLAlchemy objects via the ORM 5-8x slower than rows via a raw MySQLdb cursor? .

One can see that fetching them using bare SQL is extremely fast; also converting the records to objects using a simple for-loop is fast. Both execute in around 2-3 seconds. However using ORM's like SQLAlchemy and Hibernate, this takes 20-30 seconds: a lot slower if you ask me, and this is just a simple example without relations and joins.

SQLAlchemy gives itself the feature "Mature, High Performing Architecture," (http://www.sqlalchemy.org/features.html). Similarly for Hibernate "High Performance" (http://hibernate.org/orm/). In a way both are right, because they allow for very generic object oriented data models to be mapped back and forth to a MySQL database. On the other hand they are awfully wrong, since they are 10x slower than just SQL and native code. Personally I think they could do better benchmarks to show this, that is, a benchmark comparing with native SQL + java or python. But that is not the problem at hand.

Of course, I don't want SQL + native code, as it is hard to maintain. So I was wondering why there does not exist something like an object oriented database, which handles the database->object mapping native. Someone suggested OrientDB, hence I tried it. The API is quite nice: when you have your getters and setters right, the object is insertable and selectable.

But I want more than just API-sweetness, so I tried the 1M example:

import java.io.Serializable;

public class Foo implements Serializable {
    public Foo() {}
    public Foo(int a, int b, int c) { this.a=a; this.b=b; this.c=c; }
    public int a,b,c;

    public int getA() { return a; }
    public void setA(int a) { this.a=a; }
    public int getB() { return b; }
    public void setB(int b) { this.b=b; }
    public int getC() { return c; }
    public void setC(int c) { this.c=c; }
}
import com.orientechnologies.orient.object.db.OObjectDatabaseTx;

public class Main {
    public static void insert() throws Exception {
        OObjectDatabaseTx db = new OObjectDatabaseTx ("plocal:/opt/orientdb-community-1.7.6/databases/test").open("admin", "admin");
        db.getEntityManager().registerEntityClass(Foo.class);

        int N=1000000;
        long time = System.currentTimeMillis();
        for(int i=0; i<N; i++) {
            Foo foo = new Foo(i, i*i, i+i*i);
            db.save(foo);
        }
        db.close();
        System.out.println(System.currentTimeMillis() - time);
    }

    public static void fetch() {
        OObjectDatabaseTx db = new OObjectDatabaseTx ("plocal:/opt/orientdb-community-1.7.6/databases/test").open("admin", "admin");
        db.getEntityManager().registerEntityClass(Foo.class);
        long time = System.currentTimeMillis();

        for (Foo f : db.browseClass(Foo.class).setFetchPlan("*:-1")) {
            if(f.getA() == 345234) System.out.println(f.getB());
        }
        System.out.println("Fetching all Foo records took: " + (System.currentTimeMillis() - time) + " ms");
        db.close();
    }

    public static void main(String[] args) throws Exception {
        //insert();
        fetch();
    }
}

Fetching 1M Foo's using OrientDB takes approximately 18 seconds. The for-loop with the getA() is to force the object fields to be actually loaded into memory, as I noticed that by default they are fetched lazily. I guess this may also be the reason fetching the Foo's is slow, because it has db-access each iteration instead of db-access once when it fetches everything (including the fields).

I tried to fix that using setFetchPlan("*:-1"), I figured it may also apply on fields, but that did not seem to work.

Question: Is there a way to do this fast, preferably in the 2-3 seconds range? Why does this take 18 seconds, whilst the bare SQL version uses 3 seconds?

Addition: Using a ODatabaseDocumentTX like @frens-jan-rumph suggested only gave ma a speedup of approximately 5, but of approximatelt 2. Adjusting the following code gave me a running time of approximately 9 seconds. This is still 3 times slower than raw sql whilst no conversion to Foo's was executed. Almost all time goes to the for-loop.

public static void fetch() {
    ODatabaseDocumentTx db = new ODatabaseDocumentTx ("plocal:/opt/orientdb-community-1.7.6/databases/pits2").open("admin", "admin");
    long time = System.currentTimeMillis();
    ORecordIteratorClass<ODocument> it = db.browseClass("Foo");
    it.setFetchPlan("*:0");
    System.out.println("Fetching all Foo records took: " + (System.currentTimeMillis() - time) + " ms");
    time = System.currentTimeMillis();
    for (ODocument f : it) {
        //if((int)f.field("a") == 345234) System.out.println(f.field("b"));
    }
    System.out.println("Iterating all Foo records took: " + (System.currentTimeMillis() - time) + " ms");
    db.close();
}
Community
  • 1
  • 1
Herbert
  • 5,279
  • 5
  • 44
  • 69

3 Answers3

0

The answer lies in convenience.

During an interview, when I asked a candidate what they thought of LINQ (C# I know, but pertinent to your question), they quite rightly answered that it was a sacrifice of performance, over convenience.

A hand-written SQL statement (whether or not it calls a stored procedure) is always going to be faster than using an ORM that auto-magically converts the results of the query in to nice, easy-to-use POCOs.

That said, the difference should not be that great as you have experienced. Yes, there is overhead in doing it the auto-magical way, but it shouldn't be that great. I do have experience here, and within C# I have had to use special reflection classes to reduce the time it takes to do this auto-magical mapping.

With large swabs of data, I would expect an initial slow-down from an ORM, but then it would be negligible. 3 seconds to 18 seconds is huge.

Moo-Juice
  • 38,257
  • 10
  • 78
  • 128
  • 1
    I see your point and I agree, but this does not tell me how to get a "negligible slow-down" instead of a slow-down of factor 6 :) I.e. how can my code run faster ;) – Herbert Jul 17 '14 at 19:51
  • 1
    @Herbert, to answer *that*, we'd need to see the code that your chosen ORM was doing, to convert the raw-rows to our friendly pocos! – Moo-Juice Jul 17 '14 at 19:55
  • 1
    I am not using an ORM like that. OrientDB is a graph / document database engine (you can choose at db creation). It has links (like pointers) to objects, instead of join-like structures with ID's and indexes. Conversion to POJO's (it's java, not C) is, to the best of my knowledge, trivial and hence fast by design. I was hoping for a speedup from someone who knows OrientDB; or a suggestion for another fast (3-seconds fast) DB+ORM architecture. – Herbert Jul 17 '14 at 21:19
0

If you profile your test, you would discover that around 60 - 80% of the CPU time is taken by execution of the following four methods:

  • com.orienttechnologies...OObjectEntitySerializer.getField(...)
  • com.orienttechnologies...OObjectEntityEnhancer.getProxiedInstance(...)
  • com.orienttechnologies...OObjectMethodFilter.isScalaClass(...)
  • javaassist...SecurityActions.getDeclaredMethods(...)

So yes, in this setup the bottleneck is in the ORM layer. Using ODatabaseDocumentTx provides a speedup of around 5x. Might just get you where you want to be.

Still a lot of time (close to 50%) is spent in com.orientechnologies...OJNADirectMemory.getInt(...). That's expensive for just reading an integer from a memory location. Don't understand why not just the java nio bytebuffers are used here. Saves a lot of crossing the Java / native border, etc.

Apart from these micro benchmarks and remarkable behaviour in OrientDB I think that there are at least two other things to consider:

  • Does this test reflect your expected workload?
    • I.e. you read a straightforward list of records. If so, why use a database? If not, then test on the actual workload, e.g. your searches, graph traversals, etc.
  • Does this test reflect your expected setup?
    • E.g. you are reading from a plocal database while reading from any database over tcp/ip might just as well have its bottleneck somewhere else. Also, you are reading from one thread / process; if you expect concurrent use of the database, this probably throws things off considerably (disk seeks, more book keeping overhead, etc.)

P.S. I would recommend warming up code before benchmarking

Frens Jan
  • 319
  • 2
  • 13
0

What you do here is a worst case scenario. As you wrote (or should have wrote) for your database your test is just reading a table and writes it directly to a stream of whatever.

So what you see is the complete overhead of alot of magic. Usually if you do something more complex like joining, selecting, filtering and ordering the overhead of your ORM comes down to a more reasonable share of 5 to 10%.

Another thing you should think about - I guess orient is doing the same - the ORM solution is creating new objects multiplying memory consumption and Java is really bad on memory consumption and the reason why I use custom in memory tables all the time I handle a lot of data / objects.

You know where an object is a row in a table.

Another thing your objects get also inserted into a list / map (at least Hibernate is doing it). It tracks the dirtiness of the objects once you change them. This insertion also takes a lot of time when you rescale it and is a reason why we use paginated lists or maps. copying 1M references is dead slow if the area grows.

Martin Kersten
  • 5,127
  • 8
  • 46
  • 77