22

In Tinkerpop 3, how to perform pagination? I want to fetch the first 10 elements of a query, then the next 10 without having to load them all in memory. For example, the query below returns 1000,000 records. I want to fetch them 10 by 10 without loading all the 1000,000 at once.

g.V().has("key", value).limit(10)

Edit

A solution that works through HttpChannelizer on Gremlin Server would be ideal.

Mohamed Taher Alrefaie
  • 15,698
  • 9
  • 48
  • 66

2 Answers2

31

From a functional perspective, a nice looking bit of Gremlin for paging would be:

gremlin> g.V().hasLabel('person').fold().as('persons','count').
               select('persons','count').
                 by(range(local, 0, 2)).
                 by(count(local))
==>[persons:[v[1],v[2]],count:4]
gremlin> g.V().hasLabel('person').fold().as('persons','count').
               select('persons','count').
                 by(range(local, 2, 4)).
                 by(count(local))
==>[persons:[v[4],v[6]],count:4]

In this way you get the total count of vertices with the result. Unfortunately, the fold() forces you to count all the vertices which will require iterating them all (i.e. bringing them all into memory).

There really is no way to avoid iterating all 100,000 vertices in this case as long as you intend to execute your traversal in multiple separate attempts. For example:

gremlin> g.V().hasLabel('person').range(0,2)
==>v[1]
==>v[2]
gremlin> g.V().hasLabel('person').range(2,4)
==>v[4]
==>v[6]

The first statement is the same as if you'd terminated the traversal with limit(2). On the second traversal, that only wants the second two vertices, it not as though you magically skip iterating the first two as it is a new traversal. I'm not aware of any TinkerPop graph database implementation that will do that efficiently - they all have that behavior.

The only way to do ten vertices at a time without having them all in memory is to use the same Traversal instance as in:

gremlin> t = g.V().hasLabel('person');[]
gremlin> t.next(2)
==>v[1]
==>v[2]
gremlin> t.next(2)
==>v[4]
==>v[6]

With that model you only iterate the vertices once and don't bring them all into memory at a single point in time.

Some other thoughts on this topic can be found in this blog post.

stephen mallette
  • 45,298
  • 5
  • 67
  • 135
  • Thanks Stephen, the last solution sounds good. However, how would you do it if you're using an `HttpChannelizer`? – Mohamed Taher Alrefaie Oct 03 '16 at 13:47
  • 4
    If you are doing HTTP, you can't. You would have to switch to websockets and then you get that streaming for free, or if you wanted to control it more manually you could use a session. – stephen mallette Oct 03 '16 at 13:53
  • Is it possible in python to save traversal for subsequent multiple iterations by `.next()` step? In java (as I understand for janus-graph) it can be done by using of session for queries so that - create session `Client client = cluster.connect("uniqueSessionName",true);` , define traversal variable by one query (e.g. `t = g.V().hasLabel('person');`) and iterate over that traversal in other query (e.g. `t.next(2))` But how it can be done for pythongremlin? – palandlom Dec 07 '18 at 08:06
  • 1
    there is no `connect()` option in python but you can construct a "session" request message yourself and send that via `Client.submit()`. the format for that message is [here](http://tinkerpop.apache.org/docs/3.4.0-SNAPSHOT/dev/provider/#_session_opprocessor) and isn't so different from the standard sessionless form of the message constructed in `submit()` shown [here](https://github.com/apache/tinkerpop/blob/1041f86d2e77d14dc214fdb917c73387987de546/gremlin-python/src/main/jython/gremlin_python/driver/client.py#L113) – stephen mallette Dec 07 '18 at 11:34
  • Thanks, Stephen. But, when I'm trying to set `processor = `session` in ` message = request.RequestMessage( processor='session', op='eval' ... 'session': '4ad866cd-def9-4a76-86a1-f788785bb482', 'aliases': {'g': self._traversal_source}})` ... I get error `Exception("Unknown processor")` - as I understand there is no such processor in [serializer.py](https://github.com/apache/tinkerpop/blob/1041f86d2e77d14dc214fdb917c73387987de546/gremlin-python/src/main/jython/gremlin_python/driver/serializer.py) – palandlom Dec 07 '18 at 14:12
  • hmm - sorry, i didn't know about that. i guess you'd need to add a processor to the serializer?? i suppose gremlin-python should be modified to support this natively. honestly, it would be better to try to avoid sessions though if you can. they aren't supported cleanly across all graphs so your code becomes less portable for using them. they also come with a bit of overhead. – stephen mallette Dec 07 '18 at 14:58
1

Why not add order().by() and perform range() function on your gremlin query.

4b0
  • 21,981
  • 30
  • 95
  • 142
jaypeeig
  • 19
  • 1