0

I am wanting to perform random samples from a large database, and I am wanting those samples to be paired, which means that I either I care about the order of results from a (series of) select statement(s) or reorder afterwards. Additionally, there may be duplicate rows as well. This is fine, but I want an efficient way to make these samples straight from the db. I understand that SELECT statements cannot be used with cursor.executemany but really that is what I would like.

There is a similar question here where the OP seems to be asking for a multi-select, but it happy with the current top answer which suggests using IN in the where clause. This is not what I am looking for really. I'd prefer something more like ken.ganong's solution, but wonder about the efficiency of this.

More precisely, I do something like the following:

import sqlite3
import numpy as np
# create the database and inject some values
values = [
    (1, "Hannibal Smith", "Command"),
    (2, "The Faceman", "Charm"),
    (3, "Murdock", "Pilot"),
    (4, "B.A. Baracas", "Muscle")]
con = sqlite3.connect('/tmp/test.db')
cur = con.cursor()
cur.execute(
    'CREATE TABLE a_team (tid INTEGER PRIMARY KEY, name TEXT, role TEXT)')
con.commit()
cur.executemany('INSERT INTO a_team VALUES(?, ?, ?)', values)
con.commit()
# now let's say that I have these pairs of values I want to select role's for
tid_pairs = np.array([(1,2), (1,3), (2,1), (4,3), (3,4), (4,3)])
# what I currently do is run multiple selects, insert into a running
# list and then numpy.array and reshape the result
out_roles = []
select_query = "SELECT role FROM a_team WHERE tid = ?"
for tid in tid_pairs.flatten():
    cur.execute(select_query, (tid,))
    out_roles.append(cur.fetchall()[0][0])
#
role_pairs = np.array(out_roles).reshape(tid_pairs.shape)

To me it seems like there must be a more efficient way of passing a SELECT statement to the db which requests multiple rows each with their own constrants, but as I say executemany cannot be used with a SELECT statement. The alternative is to use an IN constraint in the WHERE clause then make the duplicates within python.

There are a few extra constraints, for instance, I may have non-existing rows in the db and I may want to handle that by dropping an output pair, or replacing with a default value, but these things are a side issue.

Thanks in advance.

Community
  • 1
  • 1
pandamonium
  • 109
  • 7
  • `DISTINCT` or `GROUP BY` clauses can't help you together with `IN`? – Mauro Baraldi Aug 07 '14 at 14:07
  • I am not sure how you mean to use either `DISTINCT` or `GROUP BY`. `tid` is already a `PRIMARY KEY` so these wouldn't be duplicated in a single `SELECT` with a `WHERE tid IN...` clause, and anyway I want to preserve duplication and input ordering. – pandamonium Aug 12 '14 at 11:08

0 Answers0