Efficient Gremlin queries for Variable-length paths

Question

I'm trying to benchmark the performance of Gremlin for biology-related knowledge graphs.

I need to write a Gremlin query that is equivalent to this Neo4j/Cypher:

MATCH path = (gene:Gene) - [:enc] -> (prot:Protein)
  - [:h_s_s|ortho|xref*0..2] - (prot1:Protein)
  - [:is_a|ac_by] - (enz:Enzyme)
  - [:ac_by|in_by] -> (cmp:Comp)
  - [:cs_by|pd_by] -> (trn:Transport) 
  - [:part_of*0..3] -> (pwy:Path)

RETURN 
  [ n in nodes(path) | n.iri ] as nodeIris, 
  rand() AS rnd
ORDER BY rnd
LIMIT 100

That is, proteins might be linked to other proteins by 1-2 relations like xref (actually, they might be much longer, but I'm setting a limit), and Path(-ways) might be part of other pathways (again, I'm limiting it). For both proteins and pathways, there are chains of variable lengths and I want to catch all of them up to the max len.

My understanding is that this is the Gremlin equivalent (label names are changed to support multiple labels):

g.V().hasLabel ( 'Concept:Gene:Resource' )
  .out ( 'enc' ).hasLabel ( 'Concept:Protein:Resource' )

  .emit ()
  .repeat ( both ( 'h_s_s', 'ortho', 'xref' ).simplePath().hasLabel ( 'Concept:Protein:Resource' ) )
  .times ( 2 )

  .both ( 'is_a', 'ac_by' ).hasLabel ( 'Concept:Enzyme:Resource' )

  .out ( 'ac_by', 'in_by' ).hasLabel ( 'Comp:Concept:Resource' ) 
  .out ( 'cs_by', 'pd_by' ).hasLabel ( 'Concept:Resource:Transport' )

  .emit ()
  .repeat ( both ( 'part_of' ).simplePath().hasLabel ( 'Concept:Path:Resource' ) ) 
  .times ( 3 )

.sample ( 100 )
.path ().by ( 'iri' )

While this works, but it's extremely slow (like 10-20secs). Is emit()/repeat()/times() the most efficient way to do it?

I might try unions with explicit paths of variable lengths, but that's not a very expressive and easy-to-write approach.

Which database are you using? the performance totally depends on the database's implementation of repeat step. — PrashantUpadhyay, Aug 29 '23 at 00:38
@PrashantUpadhyay ArcadeDB, but from .explain() and .profile(), I suspect it wouldn't be much different in other implementations. If there is a good way to write this query, likely, the diff across DBs shouldn't be tens of seconds. — zakmck, Aug 29 '23 at 10:17
One comment, note that in Gremlin `times(2)` does not have the same meaning as `0..2` (at most two) in Cypher. the equivalent in Gremlin would be to use something like `until(has().or().loops().is(2)).has()` You might also want to use `optional` to account for the `0` in `0..2`. To the other comment, the database used can have a very significant impact of the performance of any query. Not all Gremlin implementations simply use the open source TinkerPop classes to execute a query. — Kelvin Lawrence, Aug 29 '23 at 16:44
Thanks, @KelvinLawrence. That's strange, cause the until() version wasn't yielding the expected results (Cypher gives more), while this one, which uses emit() + times(), does. And I took it from your book, the case described at https://www.kelvinlawrence.net/book/PracticalGremlin.html#emit seems to be the same. — zakmck, Aug 29 '23 at 20:54

Efficient Gremlin queries for Variable-length paths

0 Answers0