7

I'm using Amazon Neptune to create and query a simple graph database. I'm currently running my code in an AWS Jupyter Notebook but will eventually move the code to Python (gremlin_python). As you can probably guess I'm pretty new to Gremlin and graph databases in general.

I have the following data

g.addV('person').property(id, 'john')
 .addV('person').property(id, 'jim')
 .addV('person').property(id, 'pam')
 .addV('game').property(id, 'G1')
 .addV('game').property(id, 'G2')
 .addV('game').property(id, 'G3').iterate() 

g.V('john').as('p').V('G1').addE('bought').from('p').iterate()
g.V('john').as('p').V('G2').addE('bought').from('p').iterate()
g.V('john').as('p').V('G3').addE('bought').from('p').iterate()

g.V('jim').as('p').V('G1').addE('bought').from('p').iterate()
g.V('jim').as('p').V('G2').addE('bought').from('p').iterate()

g.V('pam').as('p').V('G1').addE('bought').from('p').iterate()

3 persons and 3 games in the database. My goal is, given a person, tell me which persons have bought the same games as them and which games are those

After looking at sample code (mostly from https://tinkerpop.apache.org/docs/current/recipes/#recommendation) I have the following code that tries to find games bought by

g.V('john').as('target')   Target person we are interested in comparing against
.out('bought').aggregate('target_games') // Games bought by target
.in('bought').where(P.neq('target')).dedup() // Persons who bought same games as target (excluding target and without duplicates)
.group().by().by(out("bought").where(P.within("target_games")).count()) // Find persons, group by number of co owned games
.unfold().order().by(values, desc).toList() // Unfold to create list, order by greatest number of common games

Which gives me the results:

  • {v[jim]: 2}
  • {v[pam]: 1}

Which tells me that jim has 2 of the same games as john while pam only has 1. But I want my query to return the actual games they have in common like so (still in order of most common games):

  • {v[jim]: ['G1', 'G2']}
  • {v[pam]: ['G1]}

Thanks for your help.

Kelvin Lawrence
  • 14,674
  • 2
  • 16
  • 38
G_E
  • 189
  • 1
  • 10
  • 2
    I provided a few different ways to do this in the answer below. You might want to change the question title to something like "Find people that own the games I own". The current title will not help other people searching on this same kind of issue. Thanks for providing the sample graph. That really helps when answering questions. – Kelvin Lawrence Feb 21 '21 at 14:48
  • 1
    Everything has been answered below by @KelvinLawrence except how to order the final results by number of common games. – G_E Feb 23 '21 at 08:44
  • 1
    I added a version of the query that sorts by number of games in common – Kelvin Lawrence Feb 25 '21 at 00:18

1 Answers1

6

There are a few different ways this query could be written. Here is one way that uses a mid traversal V step having found John's games to find all the other people who are not John, look at their games and see if they intersect with games that John owns.

gremlin> g.V('john').as('j').
......1>   out().
......2>   aggregate('owns').
......3>   V().
......4>   hasLabel('person').
......5>   where(neq('j')).
......6>   group().
......7>     by(id).
......8>     by(out('bought').where(within('owns')).dedup().fold())

==>[pam:[v[G1]],jim:[v[G1],v[G2]]]       

However, the mid traversal V approach is not really needed as you can just look at the incoming vertices from the games that Jown owns

gremlin> g.V('john').as('j').
......1>   out().
......2>   aggregate('owns').
......3>   in('bought').
......4>   where(neq('j')).
......5>   group().
......6>     by(id).
......7>     by(out('bought').where(within('owns')).dedup().fold())

==>[pam:[v[G1]],jim:[v[G1],v[G2]]]         
                   

Finally, here is a third way, where the dedup step is applied sooner. This is likely to be the most efficient of the three.

gremlin> g.V('john').as('j').
......1>   out().
......2>   aggregate('owns').
......3>   in('bought').
......4>   where(neq('j')).
......5>   dedup().
......6>   group().
......7>     by(id).
......8>     by(out('bought').where(within('owns')).fold())

==>[pam:[v[G1]],jim:[v[G1],v[G2]]]    

UPDATED based on comments discussion. I'm not sure that this is a simpler query but you can extract a group from a projection like this:

gremlin> g.V('john').as('j').
......1>   out().as('johnGames').
......2>   in('bought').
......3>   where(neq('j')).as('personPurchasedJohnGames').
......4>   project('johnGames','personPurchasedJohnGames').
......5>     by(select('johnGames')).
......6>     by(select('personPurchasedJohnGames')).
......7>   group().
......8>     by(select('personPurchasedJohnGames')).
......9>     by(select('johnGames').fold())

==>[v[pam]:[v[G1]],v[jim]:[v[G1],v[G2]]]      

but actually you can further reduce this to

gremlin> g.V('john').as('j').
......1>   out().as('johnGames').
......2>   in('bought').
......3>   where(neq('j')).as('personPurchasedJohnGames').
......4>   group().
......5>     by(select('personPurchasedJohnGames')).
......6>     by(select('johnGames').fold())

==>[v[pam]:[v[G1]],v[jim]:[v[G1],v[G2]]]        

So now we have many choices to pick from! It will be interesting to measure these and see if any are faster than others. In general I have a tendency to avoid use of as steps as that causes path tracking to be turned on (using up memory) but as we already have an as('j') in the other queries not really a big deal.

EDITED AGAIN to add ordering of results

g.V('john').as('j').
   out().as('johnGames').
   in('bought').
   where(neq('j')).as('personPurchasedJohnGames').
   group().
     by(select('personPurchasedJohnGames')).
     by(select('johnGames').fold()).
   unfold().
   order().
    by(select(values).count(local),desc)

{v[jim]: [v[G1], v[G2]]}
{v[pam]: [v[G1]]}
Kelvin Lawrence
  • 14,674
  • 2
  • 16
  • 38
  • 1
    Awesome! this is really helpfull. I have been trying this out myself, couldn't get it working mainly because i was approaching in different way :( . All these three solutions are selecting the games of people how have purchased John games and exclude John games from the list. is it possible to do this without traversing all the games of each individual? we are starting at John , going to his games(collect them), get all other people who purchased these games and stop there, because we have everything we need, and simply group the games and people? – Balu Vyamajala Feb 21 '21 at 15:14
  • 1
    let say `g.V('john').as('j').out().as('johnGames').in().where(neq('j')).as('personPurchasedJohnGames').project('johnGames','personPurchasedJohnGames').by(select('johnGames')).by(select('personPurchasedJohnGames'))` , but just not able to group by person – Balu Vyamajala Feb 21 '21 at 15:23
  • 1
    I added an example of how to group the results of the project and also how to remove it. – Kelvin Lawrence Feb 21 '21 at 19:43
  • 1
    Great!! Last one is what i have ben trying to get to. i didn't use .fold() in `.by(select('personPurchasedJohnGames')).by(select('johnGames').fold())` and have been getting just 1 game. still learning only a month into Graph database. Appreciate the answers with multiple examples. I would double upvote if i could :) – Balu Vyamajala Feb 21 '21 at 19:53
  • 2
    Glad it helped. Please accept the answer so others can see it is answered. – Kelvin Lawrence Feb 21 '21 at 20:07
  • Thank you so much for your help. I have tried a few of the queries and they have all worked. The only thing missing for me is how to order the results by number of common games. – G_E Feb 22 '21 at 08:54
  • Just to clarify - you want to order the results basically by the count of common games so the ones with the most come first etc? – Kelvin Lawrence Feb 24 '21 at 21:53
  • I added a version of the query that sorts by number of games in common – Kelvin Lawrence Feb 25 '21 at 00:17