How to get nth record from result set of aerospike scan() output in Python, without looping through all results?

Question

A beginner question with python probably.

I am able to iterate over the results of aerospike db query like this -

client = aerospike.client(config).connect()

scan = client.scan('namespace', 'setName')

scan.select('PK','expiresIn','clientId','scopes','roles')  # scan from aerospike


scan.foreach(process_result)

def process_result((key, metadata, record)):
       expiresIn = record.get("expiresIn")

Now, all I want to do is get the nth record from this set, without having to iterate through all.

I tried looking at Get the nth item of a generator in Python but could not make much sense.

score 2 · Answer 1 · edited Nov 03 '19 at 03:04

2

Results from scan operation come from all the nodes in the cluster, pipelined, in no particular order. In that sense, there is no difference between the first record or the Nth record in terms of ordering. There is no order.

I wrote some Medium posts on how to sort results from a scan or query:

edited Nov 03 '19 at 03:04

Ronen Botzer

6,951
22
41

answered Jul 29 '18 at 15:59

pgupta

5,130
11
8

I just need to randomly pick some 10 records in the result set of size n and grab the primary key and then compare the same record in another database expected to have the same data. We are basically doing a migration from aerospike to other db and this is part of the sanity checks. – Sandeepan Nath Jul 29 '18 at 19:51
You can certainly kill the scancall back after getting 10 records, once you stop consuming scan output in the client, scan job on server will be eventually killed - I think its 10 seconds or so of pipeline full before it declares it abandoned and kills it. – pgupta Jul 29 '18 at 20:16
So you mean every time I iterate through the output of scan call, I am going through a random list already? So, if I call the scan job again, using the same script after some time, I will get a new order? Does calling close() ensure the same, without having to wait for that time? – Sandeepan Nath Jul 29 '18 at 20:21
Correcting my understanding - every time I call scan() after that time interval? – Sandeepan Nath Jul 29 '18 at 20:24
Lets say the list is not guaranteed to be repeatable in order. Further if the cluster underneath undergoes a rebalancing event during an active scan job, you may get duplicate or missed records unless you set failOnClusterChange to true. – pgupta Jul 29 '18 at 21:42
Scan supports a %. If you are looking for a random sample, you can request for 1% of the data (cannot say 0.1%). – sunil Jul 30 '18 at 01:31
Hmm cool. ScanPolicy in Java: public int scanPercent Percent of data to scan. Valid integer range is 1 to 100. Default is 100. – pgupta Jul 30 '18 at 01:44

score 1 · Answer 2 · answered Aug 06 '18 at 00:45

As usual, the workaround would be to set the scan policy to return just the digests, store them as a list (or several records with smaller lists) and paginate over those wth batch reads. You can set reasonable TTLs so that this result set has a reasonable length of time.

I can provide sample code if needed.

How to get nth record from result set of aerospike scan() output in Python, without looping through all results?

2 Answers2