2

I am trying to extract a random sample of nodes from Neo4j using Gremlin. After searching around, I could not find an appropriate way to do it.

I use Neo4j via the REST API.

My ideal query would be something like this:

resultset.sample(50)

Obviously, there is no such method. Searching around, I found .random() which would emit only random nodes. I thought about doing something like this:

ratio = (50 / resultset.count()) * 1.25
resultset.random(ratio)

The goal was to get a random set of approximately the same size, but with a few more results. From the calling script, I would have shuffled it and select the first 50. However, this does not work either as resultset is empty after counting.

I also considered getting a fixed ratio and getting a subset, but without the shuffle the last nodes have a lesser chance of being taken and I want to avoid sending over more data than needed.

I could also get the resultset to be be populated twice, once to count and once to filter. However, it does not seem right.

What would be a good way to obtain a random sample?

Edit: (based on Comments by Marko A. Rodriguez

I came up with the following:

nodes = ... some expression ...
candidates = nodes.toList()
Collections.shuffle(candidates)

size = 50
if (candidates.count() >= size) {
    return candidates[0..(size-1)]; 
} else {
    return candidates;
}

I find the last condition a little annoying, but the slicing fails if it has fewer entries.

Also, does this have an impact on larger datasets for Neo4j? As far as network communications go, it is optimal.

Louis-Philippe Huberdeau
  • 5,341
  • 1
  • 19
  • 22

1 Answers1

2

Given that you need a specific count, you could generate a list and then sample that list. For example:

MyHelper.getRandomSampleFromList(my.particular.traversal.toList())

Given that you don't know how many results your traversal will return, you can't get a predetermined sample size. Your MyHelper.getRandomSampleFromList(List list) will look something like this:

Take n random elements from a List<E>?

Community
  • 1
  • 1
Marko A. Rodriguez
  • 1,702
  • 12
  • 13
  • This is what I have been trying to do. The other question you pointed to does mention Collections.shuffle(), so if I can manage to use it, it would partially solve the issue. I can just sample a much larger ratio than I need and still not send everything over REST. I will get back to you on this. – Louis-Philippe Huberdeau Feb 21 '12 at 23:36
  • I updated the question in order to leave a more complete answer for future reference. Any comments? Your help is much appreciated. – Louis-Philippe Huberdeau Feb 22 '12 at 15:17