When using Cassandra's recommended RandomPartitioner (or Murmur3Partitioner), it is not possible to do meaningful range queries on keys, because the rows are distributed around the cluster using the md5 hash of the key. These hashes are called "tokens."
Nonetheless, it would be very useful to split up a large table amongst many compute workers by assigning each a range of tokens. Using CQL3, it appears possible to issue queries directly against the tokens, however the following python does not work... EDIT: works after switching to testing against the lastest version of the cassandra database (doh!), and also updating the syntax per notes below:
## use python cql module
import cql
## If running against an old version of Cassandra, this raises:
## TApplicationException: Invalid method name: 'set_cql_version'
conn = cql.connect('localhost', cql_version='3.0.2')
cursor = conn.cursor()
try:
## remove the previous attempt to make this work
cursor.execute('DROP KEYSPACE test;')
except Exception, exc:
print exc
## make a keyspace and a simple table
cursor.execute("CREATE KEYSPACE test WITH strategy_class = 'SimpleStrategy' AND strategy_options:replication_factor = 1;")
cursor.execute("USE test;")
cursor.execute('CREATE TABLE data (k int PRIMARY KEY, v varchar);')
## put some data in the table -- must use single quotes around literals, not double quotes
cursor.execute("INSERT INTO data (k, v) VALUES (0, 'a');")
cursor.execute("INSERT INTO data (k, v) VALUES (1, 'b');")
cursor.execute("INSERT INTO data (k, v) VALUES (2, 'c');")
cursor.execute("INSERT INTO data (k, v) VALUES (3, 'd');")
## split up the full range of tokens.
## Suppose there are 2**k workers:
k = 3 # --> eight workers
token_sub_range = 2**(127 - k)
worker_num = 2 # for example
start_token = worker_num * token_sub_range
end_token = (1 + worker_num) * token_sub_range
## put single quotes around the token strings
cql3_command = "SELECT k, v FROM data WHERE token(k) >= '%d' AND token(k) < '%d';" % (start_token, end_token)
print cql3_command
## this fails with "ProgrammingError: Bad Request: line 1:28 no viable alternative at input 'token'"
cursor.execute(cql3_command)
for row in cursor:
print row
cursor.close()
conn.close()
I would ideally like to make this work with pycassa, because I prefer its more pythonic interface.
Is there a better way to do this?