I am trying to extract a random sample of nodes from Neo4j using Gremlin. After searching around, I could not find an appropriate way to do it.
I use Neo4j via the REST API.
My ideal query would be something like this:
resultset.sample(50)
Obviously, there is no such method. Searching around, I found .random() which would emit only random nodes. I thought about doing something like this:
ratio = (50 / resultset.count()) * 1.25
resultset.random(ratio)
The goal was to get a random set of approximately the same size, but with a few more results. From the calling script, I would have shuffled it and select the first 50. However, this does not work either as resultset is empty after counting.
I also considered getting a fixed ratio and getting a subset, but without the shuffle the last nodes have a lesser chance of being taken and I want to avoid sending over more data than needed.
I could also get the resultset to be be populated twice, once to count and once to filter. However, it does not seem right.
What would be a good way to obtain a random sample?
Edit: (based on Comments by Marko A. Rodriguez
I came up with the following:
nodes = ... some expression ...
candidates = nodes.toList()
Collections.shuffle(candidates)
size = 50
if (candidates.count() >= size) {
return candidates[0..(size-1)];
} else {
return candidates;
}
I find the last condition a little annoying, but the slicing fails if it has fewer entries.
Also, does this have an impact on larger datasets for Neo4j? As far as network communications go, it is optimal.