0

I am trying to move all my data from one column-family (table) to the other. Since both the tables have different descriptions, I would have to pull all data from table-1 and create a new object for table-2 and then do a bulk aync insert. My table-1 has millions of records so I cannot get all the data directly in my data structure and work that out. I am looking out for solutions to do that easily using Spring Data Cassandra with Java.

I initially planned for moving all the data to a temp table first followed by creating some composite key relations and then querying back my master table. However, it doesn't seems favorable to me. Can anyone suggest a good strategy to do this? Any leads would be appreciated. Thanks!

mp911de
  • 17,546
  • 2
  • 55
  • 95
Bhaskar
  • 337
  • 6
  • 21

2 Answers2

2

My table-1 has millions of records so I cannot get all the data directly in my data structure and work that out.

With datastax java driver you can get all data by token ranges and work out data from each token range. For example:

Set<TokenRange> tokenRanges = cassandraSession.getCluster().getMetadata().getTokenRanges();

for(TokenRange tr: tokenRanges) {
    List<Row> rows = new ArrayList<>();
    for(TokenRange sub: tr.unwrap()){
        String query = "SELECT * FROM keyspace.table WHERE token(pk) > ? AND token(pk) <= ?";
        SimpleStatement st = new SimpleStatement( query, sub.getStart(), sub.getEnd() );
        rows.addAll( session.execute( st ).all() );
    }
    transformAndWriteToNewTable(rows); 
}

Each token range contains only piece of all data and can be handled by one physical machine. You can handle each token range independently (in parallel or asynchronously) to get more performance.

Mikhail Baksheev
  • 1,394
  • 11
  • 13
  • Thanks. I was working using the tokenRanges itself. However, I am bit confused how this will behave in case I have replication factor set on my keyspace. Say, I am trying to fetch records from a keyspace and it has replication factor as 3. That would mean, the same data would be residing in three hosts. Will they have the same hash? I guess no. How can I differentiate if its the duplicate/redundant/replicated data if I use token approach? – Bhaskar Apr 10 '17 at 12:51
  • 1
    The hash doesn't depend on replication factor, it depends on partitioner and partition key value only. When you query data with token function, it will not return duplicates, this works the same as querying by partition key. See about data distribution here http://docs.datastax.com/en/cassandra/3.0/cassandra/architecture/archDataDistributeAbout.html. – Mikhail Baksheev Apr 10 '17 at 13:19
  • The code above returns the same number of rows as regular query 'select count(*) from table' for keyspace with replicatin factor = 3 – Mikhail Baksheev Apr 10 '17 at 13:26
  • Thanks, in that case while using Token Range, I shall not worry about the replication factor as it will always give me a new row of data irrespective of replication factor, seeds I have? – Bhaskar Apr 10 '17 at 14:15
  • 1
    It will give one row for each partition key, but don't forget about evantually consistency, this data can be stale, set consistency level properly to always get latest data. – Mikhail Baksheev Apr 10 '17 at 15:27
  • Thanks Mikhail. I am currently reading more on usage of TokenRanges and its implementation. I am just worried about the fact that it should never ever give me a duplicated row if get the tokenranges using the host. Set allHosts = metadata.getAllHosts(); for (Host obj : allHosts) { Set tokenRanges = metadata.getTokenRange("keyspace",obj) } In this case, I am trying to find TokenRange for each host in the node. Does it make sense or should I get all the tokens from metadata irrespective of the number of hosts? – Bhaskar Apr 10 '17 at 15:35
  • getTokenRange("keyspace",obj) according to documentation: "Returns the token ranges that are replicated on the given host, for the given keyspace" and result will contains duplicates of token ranges. There is need more efforts to find only token ranges which owned by each node. You can look at Host.getTokens and try to map token ranges by Hosts – Mikhail Baksheev Apr 10 '17 at 16:35
  • Correct! So in this case, I suppose I should not look for token ranges using getTokenRange("keyspace",obj) as it might cause duplication of data owing to replication of data on several hosts. However, I guess the approach suggested by you which isn't host specific shall be a better way as it would eradicate all the duplicates for me. – Bhaskar Apr 10 '17 at 17:15
  • I think, data migration from one table to another is a rare operation and It's not worth big complications of code – Mikhail Baksheev Apr 10 '17 at 20:12
1

You could use Apache Spark Streaming.
Technically, you will read data from the first table, do on-the-fly transformation and write to the second table.
Note, I prefer Spark scala API, as it has more elegant API and streaming jobs code would be more laconic. But if you want to do it using pure Java, that's your choice.

S. Stas
  • 800
  • 4
  • 8