5

I would like to read a huge binary file ( ~100GB ) efficiently in Java. I have to process each line of it . The line processing will be in separate threads. I don't want to load the whole file into memory. Does reading in chunks work? What will be the optimum buffer size? Any formula for that?

Vivek
  • 341
  • 1
  • 5
  • 15
  • this might be what you want to do http://stackoverflow.com/questions/11110153/java-reading-file-chunk-by-chunk – XtremeBaumer Dec 05 '16 at 11:03
  • 8
    what is a *line* in a *binary file*? – Timothy Truckle Dec 05 '16 at 11:03
  • Possible duplicate of [How do I read and write to a file using threads in java?](http://stackoverflow.com/questions/4701691/how-do-i-read-and-write-to-a-file-using-threads-in-java) – Timothy Truckle Dec 05 '16 at 11:04
  • BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream("file_to_read"), "UTF8")); int lines = 0; while (reader.readLine() != null) { lines++; } reader.close(); Sorry can't make this code look nice but it works – Tchopane Dec 05 '16 at 11:05
  • is it really a binary file or a text file to process? – Nicolas Filotto Dec 05 '16 at 11:06
  • 2
    _I don't want to load the whole file into memory._ Don't worry, you will not be able to ;) And you will be limited by the storage reading rate here, hope you are on a SSD. – AxelH Dec 05 '16 at 11:07

2 Answers2

4

If this is a binary file, then reading in "lines" does not make a lot of sense.

If the file is really binary, then use a BufferedInputStream and read bytes one at a time into byte[]. When you get to the byte that marks your end of "line", add the byte[] and the count of bytes in the line to a queue for you worker threads to process.

And repeat.

Tips:

  • Use a bounded buffer in case you can read lines faster than you can process them.
  • Recycle the byte[] objects to reduce garbage generation.

If the file is (really) text, then you could use BufferedReader and the readLine() method instead of calling read().


The above will give you reasonable performance. Depending on how much work has to be done to process each line, it may be good enough that there is no point optimizing the file reading. You can check this by profiling.

If you profiling tells you that reading is the bottle-neck, then consider using NIO with ByteBuffer or CharBuffer. It is more complicated but potentially faster than read() or readLine().


Does reading in chunks work?

BufferedReader or BufferedInputStream both read in chunks, under the covers.

What will be the optimum buffer size?

That's probably not that important what the buffer size is. I'd make it a few KB or tens of KB.

Any formula for that?

No there isn't a formula for an optimum buffer size. It will depend on variables that you can't quantify.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • The file is a mainframe one which has IBM encoded data. This file is converted to binary format which means data contains certain symbols like ¥€ etc. It is stored in windows folder as a txt file. So one can say it is text . sorry for the confusion. – Vivek Dec 05 '16 at 11:42
  • This is a very good, thorough answer with pro tips. Thank you. This is the same strategy that I ended up using. Initially I tried to parallelize the process using a bunch of threads coordinated by an ExecutorService and stowing the file data inside memory-mapped buffers; it worked, and it cut the processing time down by an order of 5, but it also consumed a ton of memory, so I fell back to using a BufferedReader. (FWIW, I was trying to write what was basically a multi-threaded grep in Java.) – Aquarelle Feb 15 '23 at 22:02
0

Java 8, streaming

Stream<String> lines = Files.lines(Paths.get("c:\myfile.txt"));
lines.forEach(l -> {
  // Do anything line by line   
});
Jay
  • 9,189
  • 12
  • 56
  • 96