1

I am having a big file (50-60GB).

I also have pretty nice machine (128GB RAM and 16 core).

Now, I want to read the entire file and do some operations. Please also note that the file is binary, so reading as string or as bytes doesn't matter to me. My IO is very slow, so I thought of reading the entire file in BufferedReader's buffer.

But I am being let down by the buffered reader's constructor.

Now, I can have a maximum of 2G buffer size. This will be very painful for me as I have to read from IO 30 times.

Seeing methods of BufferedReader, none seems exceed the 2G mark (even this read function)

Am I looking at the wrong class ?

Are there any other class in java which suit my requirement ?

My main requirement is that I can bear with the initial load time and I want to leverage the 128G memory.

Thanks

Mohan
  • 1,850
  • 1
  • 19
  • 42
  • Have you actually measured how speed relates to buffer size? Try 128k, 1M, 8M, 64M, 512M, 2G. – user253751 Jun 03 '14 at 05:19
  • file reading is a painfully slow process. It has nothing much to do with you RAM but the speed of you disk seek hardware. A better paradiagram would be to read a fixed chunks of data form file and let executor find the data in the chunks. This way you can leverage concurrency and also make you processing faster. Only thing you need to keep in mind here is ti make sure that chunks are not so small that executors finish processing before fresh chunks are make available and the opposite too. – Nazgul Jun 03 '14 at 05:24
  • @Nazgul That would be perfect. I have 128G memory, so I thought instead of reading multiple times, I might as well read just once (I can bear initial load time). – Mohan Jun 03 '14 at 05:34
  • You can't read binary with a `BufferedReader,` but in any case your I/O isn't 'very slow'. You can read millions of lines a second with a `BufferedReader,` and setting a huge buffer size isn't going to affect that materially. What is slow is your *processing.* If for example you are concatenating every line to a String you will bog down very quickly. Try to find a way to process the line at a time. – user207421 Jun 03 '14 at 05:50
  • reading all of data from file in 1 go will eventually "store all your file data in memory'. Also like EJP said, you cant use `BufferedReader` to read in binary files. `Reader` class of File IO is meant for character based streams. you should use `BufferedInputStream` instead. – Nazgul Jun 03 '14 at 06:27
  • @Nazgul I want to get the unique lines in a gzipped files. Since gzip preserves the '\n' I guess BufferedReader will work for my case(where I will store each lines in a hash map). I am inclined to buffered reader since it provides a readLine method inbuilt... Please correct me if I am wrong – Mohan Jun 03 '14 at 16:03
  • As EJP said, readers as not ment for stream based IO. GZIP is not a char based stream or file. Its ok that it provides readLine and thats a convenient method, but internally it also does a char by char match till it reads the new line char combination. You should try to design is around inputstream or Datastream if possible. – Nazgul Jun 04 '14 at 04:05

1 Answers1

2

Going down the IO ladder 30 times is not going to hurt the performance of your program. The time it takes to read 2G from disk into RAM (many seconds, even on a beefy machine) completely dwarfs the cost of going in and out of native code, which is on the order of a couple of milliseconds, tops.

What are you doing that requires you having the entire file in memory? Can you not simply do serial processing?

torquestomp
  • 3,304
  • 19
  • 26