Recursive Datalog queries for Datomic really slow

Question

I'm currently evaluating Datomic for the use-case of storing and querying parsed symbols that form an ontology. In total there are 225122 symbols (entities) in the database (so it's a rather big ontology, but shouldn't be a big deal for a DB).

The structure is pretty standard, symbols have

parent symbols that contain them (like in sub-symbol etc)
supersymbols (symbols they inherit from)

To have nice access to the symbols, we have a unique name for each symbol. This adds up to the following Datomic schema:

[{:db/ident :ml/name,
  :db/valueType :db.type/string,
  :db/cardinality :db.cardinality/one,
  :db/unique :db.unique/identity}
 {:db/ident :ml/parent,
  :db/valueType :db.type/ref,
  :db/index true,
  :db/cardinality :db.cardinality/one}
 {:db/ident :ml/superclass,
  :db/valueType :db.type/ref,
  :db/index true,
  :db/cardinality :db.cardinality/one}]

Now I have the most basic recursive query "give me all symbols that are (transitively) contained in symbol p". In Datomic terms:

(def rules
  '[
    [(ubersymbol ?c ?p) (?c :ml/parent ?p)]
    [(ubersymbol ?c ?p) (?c :ml/parent ?c1) (ubersymbol ?c1 ?p) ]
    ])
(q '[:find ?c ?n :in $ % :where
     (ubersymbol ?c ?d) [?d :ml/name "name of a root symbol"] [?c :ml/name ?n]]
   current-db rules)

The query itself (so a mid-sized symbol) takes between 5 and 5.5 seconds and returns 80 hits. Not milliseconds, but real seconds. And this is only the most basic query I want to ask about the dataset (it's intended to be used from a web-tool to help modellers understand the structure of the ontology).

I'm running datomic-pro-0.9.5554, with a memory database and use the peer library (I started the server as described in the "getting started" guide.

Help is really appreciated to make a case for Datomic.

markus

Can you try reversing the order of the body of your second rule? — Valentin Waeselynck, Feb 26 '17 at 08:22
it is still 5 secs (I followed the recommendation to use the most selective clause first) — fricke, Feb 26 '17 at 11:40

Valentin Waeselynck · Accepted Answer · 2017-02-27T12:59:35.093

7

EDIT

As fricke discovered by himself, it was a problem of clause ordering, but in the query, not in the ruleset. A more efficient version would be:

[:find ?c ?n :in $ % :where
   [?d :ml/name "name of a root symbol"]
   (ubersymbol ?c ?d) 
   [?c :ml/name ?n]]

The above query can be further improved by:

using query parameters instead of using a dynamic parameter in the query body
using a lookup ref to resolve the input entity by its :ml/name

which yields:

(d/q
  '[:find ?c ?n :in % $ ?d :where
    (ubersymbol ?c ?d)
    [?c :ml/name ?n]]
  rules current-db [:ml/name "name of a root symbol"])

My theory is that your rules are not written in a way that Datalog can optimize for this read pattern - probably resulting in a traversal of all the entities. I suggest to rewrite them as follows:

[[(ubersymbol ?c ?p) 
  (?c :ml/parent ?p)]
 [(ubersymbol ?c ?p) 
  ;; we bind a child of the ancestor, instead of a parent of the descendant
  (?c1 :ml/parent ?p)
  (ubersymbol ?c ?c1)]]

This way of writing the ruleset is optimized to find the descendants of some node. The way you originally wrote it is optimized to find the anscestors of some node.

A quick benchmark on my machine with Datomic 0.9.5385 on a balanced binary tree of 50000 entities showed that you do get the desired performance with the second approach.

edited Feb 27 '17 at 12:59

answered Feb 26 '17 at 14:58

Valentin Waeselynck

5,950
26
43

Strangely, with this change it now takes 10secs. What is even stranger, is that if I query with a `:ml/name` that is not known it also takes 10secs to return an empty result (though in my opinion it should be much faster) – fricke Feb 26 '17 at 20:54
strange indeed... not sure I can provide anymore help without more knowledge of the data distribution. – Valentin Waeselynck Feb 26 '17 at 20:58
I figured it out now, I have to swap the `:ml/name` to be the first clause, so the query has to be `(q '[:find ?c ?n :in $ % :where [?d :ml/name "12880"] (uberscope ?c ?d) [?c :ml/name ?n]] (db @conn) rules)`, then it only takes 40msec. Thanks a lot – fricke Feb 26 '17 at 21:00
1

btw: with the new query, it also works with the first rule-set. I guess that when the `(uberscope ?c ?p)` was first, the query-engine didn't reorder the clauses to get the most restrictive one at the beginning. – fricke Feb 26 '17 at 21:04
Datomic does not attempt to optimize queries. It is up to you to put the most restrictive clauses first in order to speed processing. – Alan Thompson Feb 27 '17 at 05:50
@AlanThompson arguably, the Datalog engine does a bit of optimization (e. g choosing the indexes to navigate and range predicates), just not very much of it :) – Valentin Waeselynck Feb 27 '17 at 13:24
True, it does a bit of reordering - I experimented with more complex queries (including different transitive closures) and query-times are very sensitive to the order of the clauses. It tell us how much SQL-engine are actually doing in being so robust. But you should always keep in mind how queries are evaluated in datalog - and re-reading the documentation I found the recommendation of doing so. – fricke Feb 27 '17 at 14:21

Recursive Datalog queries for Datomic really slow

1 Answers1

EDIT