I am wanting to perform random samples from a large database, and I am wanting those samples to be paired, which means that I either I care about the order of results from a (series of) select statement(s) or reorder afterwards. Additionally, there may be duplicate rows as well. This is fine, but I want an efficient way to make these samples straight from the db. I understand that SELECT statements cannot be used with cursor.executemany but really that is what I would like.
There is a similar question here where the OP seems to be asking for a multi-select, but it happy with the current top answer which suggests using IN in the where clause. This is not what I am looking for really. I'd prefer something more like ken.ganong's solution, but wonder about the efficiency of this.
More precisely, I do something like the following:
import sqlite3
import numpy as np
# create the database and inject some values
values = [
(1, "Hannibal Smith", "Command"),
(2, "The Faceman", "Charm"),
(3, "Murdock", "Pilot"),
(4, "B.A. Baracas", "Muscle")]
con = sqlite3.connect('/tmp/test.db')
cur = con.cursor()
cur.execute(
'CREATE TABLE a_team (tid INTEGER PRIMARY KEY, name TEXT, role TEXT)')
con.commit()
cur.executemany('INSERT INTO a_team VALUES(?, ?, ?)', values)
con.commit()
# now let's say that I have these pairs of values I want to select role's for
tid_pairs = np.array([(1,2), (1,3), (2,1), (4,3), (3,4), (4,3)])
# what I currently do is run multiple selects, insert into a running
# list and then numpy.array and reshape the result
out_roles = []
select_query = "SELECT role FROM a_team WHERE tid = ?"
for tid in tid_pairs.flatten():
cur.execute(select_query, (tid,))
out_roles.append(cur.fetchall()[0][0])
#
role_pairs = np.array(out_roles).reshape(tid_pairs.shape)
To me it seems like there must be a more efficient way of passing a SELECT statement to the db which requests multiple rows each with their own constrants, but as I say executemany cannot be used with a SELECT statement. The alternative is to use an IN constraint in the WHERE clause then make the duplicates within python.
There are a few extra constraints, for instance, I may have non-existing rows in the db and I may want to handle that by dropping an output pair, or replacing with a default value, but these things are a side issue.
Thanks in advance.