I want to read all lines of a 1 GB large file as fast as possible into a Stream<String>
. Currently I'm using Files(path).lines()
for that. After parsing the file, I'm doing some computations (map()
/filter()
).
At first I thought this is already done in parallel, but it seems I'm wrong: when reading the file as it is, it takes about 50 seconds on my dual CPU laptop. However, if I split the file using bash commands and then process them in parallel, it only takes about 30 seconds.
I tried the following combinations:
- single file, no parallel lines() stream ~ 50 seconds
- single file,
Files(..).lines().parallel().[...]
~ 50 seconds - two files, no parallel lines() strean ~ 30 seconds
- two files,
Files(..).lines().parallel().[...]
~ 30 seconds
I ran these 4 multiple times with roughly the same results (by 1 or 2 seconds). The [...]
is a chain of map and filter only, with a toArray(...)
at the end to trigger the evaluation.
The conclusion is that there is no difference in using lines().parallel()
. As reading two files in parallel takes a shorter time, there is a performance gain from splitting the file. However it seems the whole file is read serially.
Edit:
I want to point out that I use an SSD, so there is practically no seeking time. The file has 1658652 (relatively short) lines in total.
Splitting the file in bash takes about 1.5 seconds:
time split -l 829326 file # 829326 = 1658652 / 2
split -l 829326 file 0,14s user 1,41s system 16% cpu 9,560 total
So my question is, is there any class or function in the Java 8 JDK which can parallelize reading all lines without having to split it first? For example, if I have two CPU cores,
the first line reader should start at the first line and a second one at line (totalLines/2)+1
.