20

I want to read all lines of a 1 GB large file as fast as possible into a Stream<String>. Currently I'm using Files(path).lines() for that. After parsing the file, I'm doing some computations (map()/filter()).

At first I thought this is already done in parallel, but it seems I'm wrong: when reading the file as it is, it takes about 50 seconds on my dual CPU laptop. However, if I split the file using bash commands and then process them in parallel, it only takes about 30 seconds.

I tried the following combinations:

  1. single file, no parallel lines() stream ~ 50 seconds
  2. single file, Files(..).lines().parallel().[...] ~ 50 seconds
  3. two files, no parallel lines() strean ~ 30 seconds
  4. two files, Files(..).lines().parallel().[...] ~ 30 seconds

I ran these 4 multiple times with roughly the same results (by 1 or 2 seconds). The [...] is a chain of map and filter only, with a toArray(...) at the end to trigger the evaluation.

The conclusion is that there is no difference in using lines().parallel(). As reading two files in parallel takes a shorter time, there is a performance gain from splitting the file. However it seems the whole file is read serially.

Edit:
I want to point out that I use an SSD, so there is practically no seeking time. The file has 1658652 (relatively short) lines in total. Splitting the file in bash takes about 1.5 seconds:

   time split -l 829326 file # 829326 = 1658652 / 2
   split -l 829326 file  0,14s user 1,41s system 16% cpu 9,560 total

So my question is, is there any class or function in the Java 8 JDK which can parallelize reading all lines without having to split it first? For example, if I have two CPU cores, the first line reader should start at the first line and a second one at line (totalLines/2)+1.

MC Emperor
  • 22,334
  • 15
  • 80
  • 130
user3001
  • 3,437
  • 5
  • 28
  • 54
  • 2
    What do you mean _split the file using bash_? Also, you have to read the file to find the lines. It can't just know where the line terminators are. – Sotirios Delimanolis Sep 07 '14 at 15:07
  • 4
    @SotiriosDelimanolis You can jump to an arbitrary point in the file and look for the first new line and split that way without processing all the data in the entire segment first. – Michael Petch Sep 07 '14 at 15:22
  • 1
    In bash, you can use `split -l #lines`, combine that with `wc -l` to split it into two halves. Micheal Petch understood my idea :) – user3001 Sep 07 '14 at 15:23
  • @michael Sure, but then you have to reconstruct everything leading up to that end of line. – Sotirios Delimanolis Sep 07 '14 at 15:24
  • You can jump (seek) to the midpoint of the file, read/ignore the fractional line (let the first thread read the line that spans the midpoint). That isn't the problem. You can generalize this by dividing the file into N parts easily. ... – Ira Baxter Sep 07 '14 at 15:26
  • 2
    ... Probably the real difficulty is that the file is huge and can't fit in cache, so it has to be read from disk. But the disk only has one seek head (typically), so you can pay the price of seeking between the halves (very expensive) or read sequentially from one end. The price likely gets worse as N goes up because of multiple demands on the heads. So as N goes up, you pay seek time costs that will likely impact your parallelism. My guess is you have simply try this for various N; there are too many buffers/overheads to easily guess where the sweet spot is. Expect tail-off for small N. – Ira Baxter Sep 07 '14 at 15:28
  • 1
    @IraBaxter although I agree, it is interesting that the split operation can be done from bash more efficiently. I assumed user3001 also included the time to split with the time to process. – Michael Petch Sep 07 '14 at 15:31
  • @MichaelPetch: Bash can't do anything Java can't do. Both eventually call the OS, and are dependent on what it does, and its bottlenecks. So this means the details have not been investigated enough. – Ira Baxter Sep 07 '14 at 15:36
  • I edited the question to clarify that there is practically no seeking time - compared to a hard disk. Even if I add the 2 seconds for splitting the file, there is still > 18 seconds difference. – user3001 Sep 07 '14 at 16:09
  • The problem with using parallel streams for this task is that the underlying code does not know your intent. It will likely therefore attempt to read the same bytes of the file in parallel, thereby duplicating the work and not saving any time. The better approach is to parallelize it yourself (e.g., using an ExecutorService), with each thread working on its *own* segment of data. When I did this I was able to process a 5 GB file in around 4 seconds. However, because I read the entire file into memory, it was too consumptive and I abandoned it in favor of using a BufferedReader. – Aquarelle Feb 15 '23 at 22:08

1 Answers1

5

You might find some help from this post. Trying to parallelize the actual reading of a file is probably barking up the wrong tree, as the biggest slowdown will be your file system (even on an SSD).

If you set up a file channel in memory, you should be able to process the data in parallel from there with great speed, but chances are you won't need it as you'll see a huge speed increase.

Community
  • 1
  • 1
matthewmatician
  • 312
  • 2
  • 6
  • 1
    I changed the method I used for testing, I seem to be wrong, as I accidentally limited the result size to 200 before doing the final operation (forEach(System.out::println) on the resulting stream. Actually, Files.lines(Path) reads the given file as a sequential stream. Moreover, most of the time was actually spent reading the file and not on computations, plus I ran into some GC problems. I timed Files.lines(Path).toArray() vs a method using a MappedByteBuffer(readonly, 0, file.size). The results were actually the quite same, MappedByteBuffer a bit slower here and there. – user3001 Sep 12 '14 at 01:45
  • @user3001, what was your final resolution here? – daydreamer May 07 '15 at 19:12
  • Actually I found out that it largely depends on what you do with the data in the stream. In the end I used a sequential stream to avoid problems with garbage collection. – user3001 May 07 '15 at 19:49