how to use multiple threads to process large number of files stored in the local disk ( using file lock)

Question

how to use multiple threads in java to process large number of files stored in the local disk directory ( using file lock)

I'd advise you not to. When processing a large number of files it's probably disk I/O which kills you, not the CPU. Multiple threads will only make that bottleneck worse. — Joey, Sep 18 '09 at 05:24
@Johannes, while generally true, it does depend on the processing, disk buffering and even distribution of files across different physical media. It may be that the processing is incredibly complex and far outweighs the disk I/O time. — paxdiablo, Sep 18 '09 at 05:40
Pax: That's why the "probably" is in there. But "large number of files" starts at around a few 10k for me and when they each take 3 minutes of processing then you have probably other worries (such as finding another computer to work on for the next few months). — Joey, Sep 18 '09 at 05:42
You are absolutely correct sir.. but how can I increase the processing rate to distribute tasks to created "n" number of threads .. and get done my file processing fast — , Sep 18 '09 at 05:52
in my void run program i am taking one directory as input consisting of some 10 files.. and in the main program I am increasing number of threads and finding the time taken to for n threads to complete the tasks... — , Sep 18 '09 at 05:56

gustafc · Answer 1 · 2009-09-18T09:19:57.020

You don't want to read the files in parallell (disk I/O doesn't parallelize well). Better then to let a single thread read the files, send the contents off to worker threads for parallel processing, and then collect the results from the workers. Using the excellent ExecutorService & c:o from java.util.concurrent spares you the dirty details of threading and makes your solution far more flexible.

Here's a simple example. Assuming Foo is the result of processing a file:

public List<Foo> processFiles(Iterable<File> files){
    List<Future<Foo>> futures = new ArrayList<Future<Foo>>();
    ExecutorService exec = Executors.newFixedThreadPool(
        Runtime.getRuntime().availableProcessors());
    for (File f : files){
        final byte[] bytes = readAllBytes(f); // defined elsewhere
        futures.add(exec.submit(new Callable<Foo>(){
            public Foo call(){
                InputStream in = new ByteArrayInputStream(bytes);
                // Read a Foo object from "in" and return it
            }
        }));
    }
    List<Foo> foos = new List<Foo>(futures.size());
    for (Future<Foo> f : futures) foos.add(f.get());
    exec.shutdown();
    return foos;
}

TODO: Add exception handling etc. You may also want to instantiate the ExecutorService outside of processFiles so you can reuse it between calls.

paxdiablo · Answer 2 · 2009-09-18T09:02:49.163

The best way I know of doing it (in any language, not just Java) is to use a producer/multi-consumer paradigm.

Have one thread create a queue then start up N other threads. This main thread will then enumerate all the files and place their names on that queue. Then it will place N end-of-queue markers on the queue.

The "other" threads simply read the next name off that queue and process the file. When they read off a end-of-queue marker, they exit (and the main thread can reap their exit status if need be).

This simplifies the communication between threads to the queue (which should, of course, be protected by a mutex so as to not cause race conditions with all the threads). It also allows the threads to control their own exit condition (under direction from the main thread), another good way to to avoid certain multi-threading problems.

Chad Okere · Answer 3 · 2009-09-18T08:15:15.187

Here's how I usually do it.

You can create a blocking Queue like this:

 LinkedBlockingQueue<String> files;
 files = new LinkedBlockingQueue<String>(1000); 
 AtomicBoolean done = new AtomicBoolean(false);

The queue can only hold 1000 elements, so if you some how have a billion files or whatever, you don't have to worry about running out of memory. You can change the size to whatever you want based on how much memory you want to take up.

In your main thread you do something like:

File directory = new File("path\to\folder");
for(File file : directory.listFiles()){
   files.put(file.getAbsolutePath());
}
files.put(null);//this last entry tells the worker threads to stop

The put function blocks until space becomes available in the queue, so if you fill up the files will stop reading. Of course, because File.listFiles() actually returns an array, rather then a Collection that doesn't need to be loaded entirely into memory, you still end up to loading a complete list of files into memory if you use this function. If that ends up being a problem, I guess you'll have to do something else.

But this model also works if you have some other method of listing files (for example if they're all in a database, or whatever) Just replace the call to directory.listFiles() with whatever you use to get your file list. Also, if you have to process files in sub directories, you'll have to go through them recursively, which can be annoying (but this gets around the memory issue for extreemly large directories)

then in your worker threads:

public void run(){
   while(!done.get()){
      String filename = files.take();
      if(filename != null){
         //do stuff with your file.   
      }
      else{
        done.set(true);//signal to the other threads that we found the final element.
      }
   }
}

If all the files in the queue have been processed, take will wait until new elements show up.

That's the basic idea anyway, this code is off the top of my head and hasn't been tested exactly as is.

`files.put(null);` will give you a null pointer exception as per the specs — VHS, Aug 16 '17 at 02:47

score 1 · Answer 4 · answered Aug 31 '17 at 20:22

With Java 8, you can easily achieve this using parallel streams. See the following code snippet:

    try {
        Files.walk(Paths.get("some-path")).parallel().forEach(file -> {/*do your processing*/});
    } catch (IOException e1) {
        e1.printStackTrace();
    }

With parallel stream, the run time will spawn the required number of threads, not exceeding the number of CPU logical cores, to process the collection elements, files in our case, in parallel. You can also control the number of threads by passing it as a JVM argument.

The advantage of this approach is that you don't have to really do any low level work of creating and maintaining threads. You just focus on your high level problem.

score 0 · Answer 5 · answered Sep 18 '09 at 05:52

A lot of the leg work has been done for you in the Java Concurrency classes. You probably want something like ConcurrentLinkedQueue.

An unbounded thread-safe queue based on linked nodes. This queue orders elements FIFO (first-in-first-out). The head of the queue is that element that has been on the queue the longest time. The tail of the queue is that element that has been on the queue the shortest time. New elements are inserted at the tail of the queue, and the queue retrieval operations obtain elements at the head of the queue. A ConcurrentLinkedQueue is an appropriate choice when many threads will share access to a common collection.

You use the offer() method to put entries on the queue, either in the main thread or a separate thread. Then you have a bunch of worker bees (ideally created in something like ExecutorService) that use the poll() method to pull the next entry out of the queue and process it.

Using this design gives you incredible flexibility in determining how many producers and how many consumers run concurrently, without having to do any waiting/polling code yourself. You can create your pool of minions using Executors.newFixedThreadPool().

Tim Bender · Answer 6 · 2009-09-18T06:14:22.510

What you really want to do is have your main program traverse the directory getting File references. Use those references to create an object which implements Runnable. The run() method of the Runnable is all of your processing logic. Create an ExecutorService and call execute(Runnable) to submit the tasks to the executor service. The Executor will run the tasks ask threads become available based on the type of Executor you create (Executors.newFixedThreadPool() is a good choice. When your main thread has found all of the files and submitted them as tasks, you want to call shutdown() on the Executor and then call [awaitTermination()][6]. Calling shutdown() tells the executor to finish executing the tasks it was given and then close, calling awaitTermination() causes your main thread to block until the Executor shuts down. That of course assumed you want to wait for all tasks to finish and then do more processing.

[6]: http://java.sun.com/javase/6/docs/api/java/util/concurrent/ExecutorService.html#awaitTermination(long, java.util.concurrent.TimeUnit)

score 0 · Answer 7 · answered Aug 27 '15 at 10:35

I am working on similar problem where I have to process few thousands text files. I have a file poller which polls the directory and prepares list of files found in the directory(including sub-directories), and calls a method, say, fileFound with the list as an argument.

In fileFound method, I'm iterating over the list and creating a new thread for each files. I'm using ExecutorService to control the number of active threads. Code goes like this:

public void fileFound(List<File> fileList) {
    for (File file : fileList) {
        FileProcessor fprocessor = new FileProcessor(file); // run() method takes care of implementing business rules for the file.
        EXECUTOR.submit(fprocessor); //ExecutorService EXECUTOR = Executors.newFixedThreadPool(10);
    }
}

My observation:

When processing files one by one, without multi-threading, processing 3.5K files(~32GB total), it took ~9 hours.
Using multi-threading:

When number of threads fixed to 5 - 118 minutes.

When number of threads fixed to 10 - 75 minutes.

When number of threads fixed to 15 - 72 minutes.

Can you please share your No. of COREs of CPU for which I see 10 Threads is optimal? — sunny_dev, Apr 08 '17 at 10:52

how to use multiple threads to process large number of files stored in the local disk ( using file lock)

7 Answers7

Linked