loading data fast with cassandra

Question

Right now I'm running a ec2 cluster m3xlarge and am getting around 2700rows/sec loading into cassandra. I came across this article Cassandra: Load large data fast but it seems to be a little outdated and doesn't descirbe how to load csv's that have mapped data.

Can you load mapped data with sstableloader? Also, if I increase the specs on my ec2 instance (more ram, cpu, iops), would that increase the load speed in cql?

yurgis · Answer 1 · 2015-09-02T07:56:59.700

If you want to isolate your performance issue, it is always a good idea to start from something that works... Try executing this simple test (this test assumes you running cassandra on localhost port 9042.

  @Test
  public void testThroughput() throws Exception {
    Cluster cluster = Cluster.builder()
        .addContactPoint("localhost")
        .withProtocolVersion(ProtocolVersion.V2)
        .withPort(9042)
        .build();
    Session session = cluster.connect();
    session.execute("CREATE KEYSPACE IF NOT EXISTS test" + 
                      " WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'}");
    session.execute("USE test");
    session.execute("CREATE TABLE IF NOT EXISTS parent_children (" +
                      " parentId uuid," +
                      " childId uuid," + 
                      " PRIMARY KEY (parentId, childId))");
    UUID parent = UUID.randomUUID();
    long beforeInsert = System.currentTimeMillis();
    List<ResultSetFuture> futures = new ArrayList<>();
    int n = 1000000;
    for (int i = 0; i < n; i++) {
      UUID child = UUID.randomUUID();
      futures.add(session.executeAsync("INSERT INTO parent_children (parentId, childId) VALUES (?, ?)", parent, child));
      if (i % 10000 == 0) {
        System.out.println("Inserted " + i + " of " + n + " items (" + (100 * i / n) + "%)");
      }
    }
    //to be honest with ourselves let's wait for all to finish and succeed
    List<ResultSet> succeeded = Futures.successfulAsList(futures).get();
    Assert.assertEquals(n, succeeded.size());
    long endInsert = System.currentTimeMillis();
    System.out.println("Time to insert: " + (endInsert-beforeInsert) + "ms; " + 1000 * n/(endInsert-beforeInsert) +  " per second");
    cluster.close();
  }

It auto creates "test" keyspace with single parent/child table and inserts 1M rows into the same partition using executeAsync. (you can easily modify it to insert into multi-partitions if you want).

What number are you getting? On my Mac Pro laptop, I am getting 25k per second. I am pretty sure this would scale linearly with number of cassandra nodes but only if you insert into multiple partitions (eventually you may need to increase number of concurrent clients as well).

this is a self contained java junit test case. You can put this in main() or run as a unit test. you need cassandra datastax driver lib in classpath of course. — yurgis, Sep 02 '15 at 18:13
@yurgis thanks for this example but how can I get those records that failed to be inserted in cassandra for any reason? I checked so many examples like [this](https://www.datastax.com/dev/blog/java-driver-async-queries) but none of them is not working as intended. Actually When we get to `List succeeded = Futures.successfulAsList(futures).get()` then all the errors propagate and the only way I found to get them is by catching exceptions like `ExecutionException`. How can I get errors by using `callback` or on `.get()` on the list of `Futures` — Sobhan Atar, Jun 14 '19 at 16:00

score 0 · Answer 2 · answered Aug 26 '15 at 20:31

It depends a lot on what your data model for what a row is in "2700rows/sec" but you should be able to get 10x-50x that many writes per sec with just a simple application. There maybe something about your application on why its so slow. Are you using async writes?

In many cases its faster to just write the data than to use the bulk loader options. but theres some examples from http://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated

// Prepare SSTable writer 
CQLSSTableWriter.Builder builder = CQLSSTableWriter.builder();
// set output directory 
builder.inDirectory(outputDir)
       // set target schema 
       .forTable(SCHEMA)
       // set CQL statement to put data 
       .using(INSERT_STMT)
       // set partitioner if needed 
       // default is Murmur3Partitioner so set if you use different one. 
       .withPartitioner(new Murmur3Partitioner());
CQLSSTableWriter writer = builder.build();

Another option is to use copy command in cqlsh, but I am not sure of how performant it is http://docs.datastax.com/en/cql/3.1/cql/cql_reference/copy_r.html

cqlsh> COPY music.imported_songs from 'songs-20140603.csv';

I would first try to optimize your client though since 2700w/s is obscenely slow.

score 0 · Answer 3 · edited May 23 '17 at 10:26

2700rows/s isn't that slow - it depends on your data model. I've reached 3k to 5k rows/s with 3x m1.large with my schema. Check your cluster utilization as I have: Cassandra write benchmark, low (20%) CPU usage

You can also check if cassandra-stress for your data model reaches the same number of rows/s.

Of course, try COPY command mentioned by @ChrisLohfink.

Getting back to your questions:

Can you load mapped data with sstableloader?

How do you use the Cassandra tool sstableloader?

Also, if I increase the specs on my ec2 instance (more ram, cpu, iops), would that increase the load speed in cql?

Certainly, figure out what's limiting you (check my question) and better machine without this limit.

loading data fast with cassandra

3 Answers3

Getting back to your questions: