32

I have a large file that takes multiple hours to process. So I am thinking of trying to estimate chunks and read the chunks in parallel.

Is it possible to to concurrent read on a single file? I have looked at both RandomAccessFile as well as nio.FileChannel but based on other posts am not sure if this approach would work.

Lii
  • 11,553
  • 8
  • 64
  • 88
user1132593
  • 448
  • 1
  • 5
  • 8
  • Which OS? Java or not, Windows does not handle well this kind of thing – SJuan76 Aug 08 '12 at 15:00
  • I read somewhere that when it is diskIO, you may not get advantage of concurrency. – kosa Aug 08 '12 at 15:01
  • 5
    Why the downvote? I found this question very interesting. – hectorg87 Aug 08 '12 at 15:03
  • @user1132593 I started looking for an answer to your question and found something really interesting, however I don know how far you want to go. Here: http://www.google.es/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&ved=0CHgQFjAD&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.59.658%26rep%3Drep1%26type%3Dpdf&ei=aoIiUJ_aDceZhQfIzoDgDg&usg=AFQjCNGFRpsAThAGFIoeq91GTSJYBSH2ow&sig2=1kFDHffxgtzd3yOpNeoHFw – hectorg87 Aug 08 '12 at 15:20
  • Just to make sure the discussion goes in the correct direction; is your problem that reading the data is too slow, or is it the processing of the data that is too slow? – Buhb Aug 08 '12 at 15:33
  • Since my application is in Java, i was assuming platform independence?? problem is with reading(I have to read line by line - identify a set per some criteria), after which I already use java.concurrency to do the processing in a different thread – user1132593 Aug 08 '12 at 15:54
  • Just to get an idea of what speeds you're curreently reaching up to; how big is a typical file that takes several hours to process? – Buhb Aug 08 '12 at 15:59
  • Bubh - problem statement is reading a large file is slow, so is there a way to chunk and parallize the reading of a single file, i.e. multiple threads reading different chunks of the same file – user1132593 Aug 08 '12 at 16:20
  • I just want to rule out the possibility of the current way your reading files isn't suspiciously slow. Reading a file several GB in size should take significantly less than hours. – Buhb Aug 08 '12 at 16:56

4 Answers4

23

The most important question here is what is the bottleneck in your case.

If the bottleneck is your disk IO, then there isn't much you can do at the software part. Parallelizing the computation will only make things worse, because reading the file from different parts simultaneously will degrade disk performance.

If the bottleneck is processing power, and you have multiple CPU cores, then you can take an advantage of starting multiple threads to work on different parts of the file. You can safely create several InputStreams or Readers to read different parts of the file in parallel (as long as you don't go over your operating system's limit for the number of open files). You could separate the work into tasks and run them in parallel, like in this example:

import java.io.*;
import java.util.*;
import java.util.concurrent.*;

public class Split {
    private File file;

    public Split(File file) {
        this.file = file;
    }

    // Processes the given portion of the file.
    // Called simultaneously from several threads.
    // Use your custom return type as needed, I used String just to give an example.
    public String processPart(long start, long end)
        throws Exception
    {
        InputStream is = new FileInputStream(file);
        is.skip(start);
        // do a computation using the input stream,
        // checking that we don't read more than (end-start) bytes
        System.out.println("Computing the part from " + start + " to " + end);
        Thread.sleep(1000);
        System.out.println("Finished the part from " + start + " to " + end);

        is.close();
        return "Some result";
    }

    // Creates a task that will process the given portion of the file,
    // when executed.
    public Callable<String> processPartTask(final long start, final long end) {
        return new Callable<String>() {
            public String call()
                throws Exception
            {
                return processPart(start, end);
            }
        };
    }

    // Splits the computation into chunks of the given size,
    // creates appropriate tasks and runs them using a 
    // given number of threads.
    public void processAll(int noOfThreads, int chunkSize)
        throws Exception
    {
        int count = (int)((file.length() + chunkSize - 1) / chunkSize);
        java.util.List<Callable<String>> tasks = new ArrayList<Callable<String>>(count);
        for(int i = 0; i < count; i++)
            tasks.add(processPartTask(i * chunkSize, Math.min(file.length(), (i+1) * chunkSize)));
        ExecutorService es = Executors.newFixedThreadPool(noOfThreads);

        java.util.List<Future<String>> results = es.invokeAll(tasks);
        es.shutdown();

        // use the results for something
        for(Future<String> result : results)
            System.out.println(result.get());
    }

    public static void main(String argv[])
        throws Exception
    {
        Split s = new Split(new File(argv[0]));
        s.processAll(8, 1000);
    }
}
Petr
  • 62,528
  • 13
  • 153
  • 317
  • thanks Petr, I have something similar but was using Runnables(old way). My observation was that only one thread was busy and thats why I posted this question. I will retry soon and post back my observations – user1132593 Aug 08 '12 at 18:03
  • 2
    I was able to chunk the File and read it concurrently. For a .5GB text file here were my results(hh.mm.ss.SSS): chunks=[1]:0:18:10.328 chunks=[2]:0:13:19.125 chunks=[3]:0:12:54.824. Not much of a difference. However for me the best solution was to zip the file and serially process the zip file. This was because of the high compression ratio. The zip file ended up being 10MB – user1132593 Aug 09 '12 at 19:40
  • i guess your answer is quiet performant . If reading a big file creates a lot of data in the memory then also your answer would be a good solution. – xpioneer Mar 15 '20 at 11:49
9

You can parallelise reading a large file provided you have multiple independent spindals. E.g. if you have a Raid 0 + 1 stripped file system, you can see a performance improvement by triggering multiple concurrent reads to the same file.

If however you have a combined file system like Raid 5 or 6 or a plain single disk. It is highly likely that reading the file sequentially is the fastest way to read from that disk. Note: the OS is smart enough to pre-fetch reads when it sees you are reading sequentially so using an additional thread to do this is unlikely to help.

i.e. using multiple threads will not make you disk any faster.

If you want to read from disk faster, use a faster drive. A typical SATA HDD can read about 60 MB/second and perform 120 IOPS. A typical SATA SSD drive can read at about 400 MB/s and perform 80,000 IOPS and a typical PCI SSD can read at 900 MB/s and perform 230,000 IOPS.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • Peter, Issue is with a single file on my hard disk. For RAID etc how would I split the File ? – user1132593 Aug 08 '12 at 16:02
  • RAID will split the file automatically if you use striping (or RAID 1 it will copy it to two disks) RAID 5 and 6 can get the benefit of stripping but this depends on your controller as these are often optimised for maximum throughput i.e. sequential reads. – Peter Lawrey Aug 08 '12 at 17:03
  • 1
    I am executing a test for the same use case - reading a single file from multiple threads. I found that having multiple threads improves the performance if the underlying storage is a SATA disk drive while improves the performance if its a SAS drive. Would it be because of the point-to-point technology used in SAS or is my test doing something incorrect? – Andy Dufresne Jun 24 '13 at 13:23
  • @AndyDufresne This is possible esp if you are doing more than just reading e.g. you are processing the data as well. Juts copying the data into user space is work. Try using direct ByteBuffers to reduce overhead and you might find less benefit. – Peter Lawrey Jun 24 '13 at 18:17
  • @PeterLawrey - I am sorry. I realized that my comment was slightly incorrect. Multi-threading improved the performance when SAS drive was used while degraded the performance when a SATA drive was used. Also the test involved multiple threads reading different sections of the same file. I didn't completely understand your comment but possibly your interpretation might have been on my earlier comment which wasn't completely correct. Let me know if you have any thoughts. – Andy Dufresne Jun 25 '13 at 05:57
  • @AndyDufresne I believe SCSI and SAS drives have extra controller support for concurrent requests. This can improve the read behaviour esp if the file is fragmented. Of course SSD drive will be much, much faster in any case, esp a PCI SSD. – Peter Lawrey Jun 25 '13 at 06:33
2

If you're reading a file from a hard drive, then the fastest way to get the data is to read the file from start to end, that is, not concurrently.

Now if it's the processing that takes time, then that might benefit from having several threads processing different chunks of data concurrently, but that has nothing to do with how you're reading the file.

Buhb
  • 7,088
  • 3
  • 24
  • 38
  • 1
    I think this doesn't answer the question. The question is: is it possible to "parallelize" the reading of a large file? – hectorg87 Aug 08 '12 at 15:04
  • I was under the impression that the fundamental question was more in the whereabouts of "can I read a file quicker by parallelizing the read?" – Buhb Aug 08 '12 at 15:05
  • 1
    After your edit: I guess it has to do with the reading because it's a "large File" as he stated. btw, the -1 is not from me – hectorg87 Aug 08 '12 at 15:06
  • 1
    You can read different parts of the file concurrently with ease by memory mapping different sections (hint: `FileChannel.map`). – obataku Aug 08 '12 at 15:08
  • As long as a reasonable number of "chunks" can fit in memory, you can skip the IO problem entirely, by having a reader thread reading chunks serially from the file, adding them to a queue, and have several worker threads processing chunks from the queue. – Buhb Aug 08 '12 at 15:08
  • Wow, I don't really mind the downvotes, but if I've misunderstood the problem that user1132593 hopes to solve by concurrently reading from different parts of a file in a way that makes my answer a bad one, I'd be really happy if someone tried to explain what I'm missing. – Buhb Aug 08 '12 at 15:15
  • @Buhb That's what the OS does for you already so doing it again might make your program slower. BTW +1 – Peter Lawrey Aug 08 '12 at 15:28
  • |@veer FileChannel.map speed up loading the disk cache into the memory of the application. It can make a huge difference if the data is already loaded. If the data is only on disk, it might not help or can make it slower. – Peter Lawrey Aug 08 '12 at 15:30
1

You can process in parallel, however your hard drive can only read one piece of data at a time. If you read in the file with a single thread, you can then process the data with several threads.

Brad
  • 334
  • 5
  • 14