1

I have a billion records which are unsorted, unrelated to each other and I have to call a function processRecord on each record using Java.

The easy way to do so is using a for loop but it is taking a lot of time.

The other way I could think of is using multithreading but the question is how to divide the dataset of records efficiently and among how many threads?

Is there an efficient way to process this large dataset?

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Yug Singh
  • 3,112
  • 5
  • 27
  • 52
  • Number of threads is up to you... Designate a specific thread to take lines 1-10000, and another for 10001-20000. Easy answer is to start with two threads. Split the file in half... Continue to divide and conquer... – OneCricketeer Apr 08 '18 at 17:47
  • Alternatively, you could attempt to use something like Apache Spark, which is the "proper solution" for BigData type problems. – OneCricketeer Apr 08 '18 at 17:52
  • @cricket_007 I just read about the batch processing of data using Hadoop MapReduce. Do you think using Hadoop MapReduce approach is efficient? – Yug Singh Apr 08 '18 at 18:06
  • Depends what you want to accomplish. MapReduce is more complicated and limiting than Spark. And Spark doesn't require Hadoop, so I personally would recommend starting with it. – OneCricketeer Apr 08 '18 at 18:07

2 Answers2

3

Measure Before figuring out which implementation path to choose you should measure how long it takes to process single item. Based on that you could choose size of work chunk submitted to thread pool, queue, cluster. Very small work chunks would increase coordination overhead. Too big work chunk will take long time to be processed so you will have less gradual progress info.

Single machine processing is simpler to implement, troubleshoot maintain and reason about.

Processing on single machine

Use java.util.concurrent.ExecutorService Submit each work piece using submit(Callable<T> task) method https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ExecutorService.html#submit-java.util.concurrent.Callable-

Create instance of ExecutorService using java.util.concurrent.Executors.newFixedThreadPool(int nThreads). Choose reasonable value for nThreads Nnumber of CPU cores is reasonable startup value. You may add use more threads if there are some blocking IO calls (database, HTTP) in processing.

Processing on multiple machines - cluster Submit processing jobs to cluster processing technologies like Spark, Hadoop, Google BigQuery.

Processing on multiple machines - queue You can submit your records to any queue system (Kafka, RabbitMQ, ActiveMQ, etc). Then have multiple machines that consume items from the queue. You will be able to add/remove consumers at any time. This approach is fine if you do not need to have single place with processing result.

Bartosz Bilicki
  • 12,599
  • 13
  • 71
  • 113
  • The most important keyword to me is "Measure". Never do any "optimization" without profiling the initial performance, finding the bottleneck / most time-consuming part. – Ralf Kleberhoff Apr 09 '18 at 08:12
1

Parallel stream could be used here to perform parallel processing of your data. By default parallel stream uses pool by one thread less than processors count.

Wide and useful information about that could be found here https://stackoverflow.com/a/21172732/8184084