6

In one of my use cases I need to fetch the data from multiple nodes. Each node maintains a range (partition) of data. The goal is to read the data as fast as possible. Constraints are, cardinality of a partition is not known before hand. Using work sharing approach, I could split the partitions into sub-partitions and fetch the data in parallel. One drawback with this approach is, it is possible that one thread could fetch lot of data and take more time while the other thread could finish faster. The other approach is to use work stealing where we can break the partitions into much smaller ranges and use ForkJoinPool. The drawback with this approach is, if the partition is sparse, we could make many round trips to the server to realize there is not data for a sub-partition.

The question I've is, if I want to use ForkJoinPool, where the tasks can do some I/O operations, how do I do that? From the documentation of the FJ pool and from the best practices I read so far, it appears like FJ pool is not good for blocking IO operations. If I want to use non-blocking IO, how can I do that?

Florian Fankhauser
  • 3,615
  • 2
  • 26
  • 30
chandra_cst
  • 307
  • 2
  • 13
  • Where are your partitions? Are they badly balanced? Is it fixable? Is a partition a single file or multiple files? – Dici Sep 21 '18 at 20:02
  • Across the nodes it is balanced well (mostly), but the data with in a range couldn't be. For instance, if one node managed data between 1 - 100000, it is possible that partition (1-1000) is empty. – chandra_cst Sep 21 '18 at 20:17
  • Can you know the size of the data cheaply? If not, can you build a cache of the size of the data? – Dici Sep 21 '18 at 20:27
  • Data size if not known. I'm exploring other approaches to have the size estimates. But, I'm more concerned about FJ pool. Is it a right fit here? If so, how do I deal with non-blocking IO when using FJ pool. If not, why? – chandra_cst Sep 21 '18 at 20:51
  • 1
    Why do you think that multiple requests to the same node are faster than one? Make one query, fetch data and split it across threads for processing. – Alexander Pavlov Sep 22 '18 at 03:20
  • 1
    "the goal is to read the data as fast as possible" is.. quite vague :-) On F/J pools, you may want to read http://coopsoft.com/ar/CalamityArticle.html – giorgiga Oct 09 '18 at 21:27
  • @giorgiga yeah, I read that before and it points out FJ pools are not good for blocking IO. – chandra_cst Oct 14 '18 at 03:37

0 Answers0