Cassandra, Java and MANY Async request : is this good?

Question

I'm developping a Java application with Cassandra with my table :

id  | registration | name 
 1          1         xxx
 1          2         xxx
 1          3         xxx
 2          1         xxx
 2          2         xxx
...        ...        ...
...        ...        ...
100,000    34        xxx

My tables have very large amount of rows (more than 50,000,000). I have a myListIds of String id to iterate over. I could use :

SELECT * FROM table WHERE id IN (1,7,18, 34,...,)
//image more than 10,000,000 numbers in 'IN'

But this is a bad pattern. So instead I'm using async request this way :

    List<ResultSetFuture> futures = new ArrayList<>();
    Map<String, ResultSetFuture> map = new HashMap<>();
   // map : key = id & value = data from Cassandra

    for (String id : myListIds)
    {
        ResultSetFuture resultSetFuture = session.executeAsync(statement.bind(id));
        mapFutures.put(id, resultSetFuture);
    }

Then I will process my data with getUninterruptibly() method.

Here is my problems : I'm doing maybe more than 10,000,000 Casandra request (one request for each 'id'). And I'm putting all these results inside a Map.

Can this cause heap memory error ? What's the best way to deal with that ?

Thank you

Sounds like a you creating a lot of overhead. I am wondering if it wouldn't make more sense to process say 1K IDs "together" somehow. Doing millions of requests will definitely create a **lot** of overhead. — GhostCat, Dec 20 '18 at 15:06
You should make clearer what it is you need in the end. Do you require the `Map` you create or is it just an intermediate object you'll throw away later? Do you need to process all of the data together, or can you do it in batches, or even one each? — daniu, Dec 20 '18 at 15:21

score 5 · Answer 1 · answered Dec 20 '18 at 15:22

Note: your question is "is this a good design pattern".

If you are having to perform 10,000,000 cassandra data requests then you have structured your data incorrectly. Ultimately you should design your database from the ground up so that you only ever have to perform 1-2 fetches.

Now, granted, if you have 5000 cassandra nodes this might not be a huge problem(it probably still is) but it still reeks of bad database design. I think the solution is to take a look at your schema.

score 0 · Answer 2 · answered Jan 11 '19 at 16:27

I see the following problems with your code:

Overloaded Cassandra cluster, it won't be able to process so many async requests, and you requests will be failed with NoHostAvailableException
Overloaded cassadra driver, your client app will fails with IO exceptions, because system will not be able process so many async requests.(see details about connection tuning https://docs.datastax.com/en/developer/java-driver/3.1/manual/pooling/)
And yes, memory issues are possible. It depends on the data size

Possible solution is limit number of async requests and process data by chunks.(E.g see this answer )

Cassandra, Java and MANY Async request : is this good?

2 Answers2