2

I am inserting and updating multiple entries into a table in Cassandra using python Cassandra driver. Currently my code looks like:

cluster = Cluster()
session = cluster.connect('db')
for a in list:
    if bool:
        # calculate b
        session.execute("UPDATE table SET col2 = %s WHERE col1 = %s", (b, a))
    else:
        # calculate b
        session.execute("INSERT INTO table(col1, col2) VALUES(%s, %s)", (a, b))

This method of insertion and update is quite slow as the number of entries in the list (all are unique) which are to be inserted is very large. Is there any faster way of doing this?

Erick Ramirez
  • 13,964
  • 1
  • 18
  • 23
  • 1
    use `Session.execute_async` method with prepared statement – Azat Ibrakov Jun 05 '17 at 06:25
  • where `bool` comes from? it is a built-in class name, don't use it for your objects – Azat Ibrakov Jun 05 '17 at 06:32
  • @AzatIbrakov does execute_async() execute the queries in order? If it is so, then I can execute many queries using execute_async() and just call result() on the ResponseFuture object returned by the last call to session.execute_async(), right? If no, then what should I do to ensure all queries have been executed(i.e. all inserts have completed)? – Gurtej Sohi Jun 05 '17 at 09:11
  • what do you mean by "in order"? they are supposed to be asynchronous, why would you need order at all? – Azat Ibrakov Jun 05 '17 at 09:42

1 Answers1

0

Generally for this scenario, you will see the best performance by increasing the number of concurrent writes to Cassandra.

You can do this with the Datastax Python Cassandra driver using execute_concurrent

From your description, it is worth noting that for your case there is no difference between an Update and an Insert with Cassandra. (i.e. you can simply do the insert statement from your else clause for all values of (a, b).

You will want to create a prepared statement.

Rather than doing the inserts one at a time in your for-loop, consider pre-computing groups of (a,b) pairs as input for execute_concurrent; you can also write a generator or generator expression as input for execute_concurrent.

Example:

parameters = ((a, calculate_b(a)) for a in my_list)
execute_concurrent_with_args(my_session, my_prepared_statement, parameters)
Andrew Mo
  • 1,433
  • 9
  • 12