Read a huge file of numbers in Java in a memory-efficient way?

Question

In Java, having a file of 335Gb size that contains individual numbers at each line, I need to read it line by line like if it was a stream of numbers - I must not keep all the data in memory. I was told that Scanner class will not work. Could you please recommend the best possible way to do that?

335 gb ? that's a huge one... can u post some sample data to see if we can tailor our solution to the data format ? — Arkantos, Feb 06 '15 at 17:22
Why won't Scanner work? BTW I assume you mean 335 GB = Giga-byte instead of Gb = giga-bit. — Peter Lawrey, Feb 06 '15 at 17:43
If you are reading a large file exactly once from beginning to end, it doesn’t matter which method you will use, they all will end up having roughly the same performance. You can’t outsmart the hard drive. — Holger, Feb 06 '15 at 17:48

score 3 · Answer 1 · answered Feb 06 '15 at 17:26

3

None of the java.io input stream classes would "keep all the data in memory". I think you are free to choose what is best for you such as BufferedReader or DataInputStream etc.

answered Feb 06 '15 at 17:26

Joe

31
1

plus one, and Scanner – Peter Lawrey Feb 06 '15 at 17:42

score 1 · Answer 2 · answered Feb 06 '15 at 17:45

1

If you use BufferedReader you should be able to get up to 90 MB/s in one thread.

You can use trick to break up the file and read portion of the data concurrently, but this will only help if your disk read through put is high.

For example you can memory map 335 GB into memory at once without using the heap much. This will work even if you have a fraction of this amount of main memory.

What is the read transfer rate you can get with your disk subsystem?

answered Feb 06 '15 at 17:45

Peter Lawrey

525,659
79
751
1,130

Why the reference to the hard number of 90 MB/s? My system surely allows more, others may be slower. I doubt that any trick will accelerate a tasks as simple as described. – Holger Feb 06 '15 at 17:52
@Holger The 90 MB/s is for a typical fast processor. If there is spare read throughput capacity, using memory mapped files and multiple threads can help reach your maximum read throughput. e.g. I have exceeded 1.2 GB/s using SSD and memory mapped files. – Peter Lawrey Feb 06 '15 at 17:55
Multi-threading is unlikely to accelerate I/O which goes serially though one bus. If you are talking about 1.2GB/s then parsing the numbers in parallel might indeed improve the though-put but that’s actually proving that on your system the I/O is *not* the bottleneck. So I don’t believe that the same system only allows 90MB/s when using `BufferedReader`… – Holger Feb 06 '15 at 18:02

Read a huge file of numbers in Java in a memory-efficient way?

2 Answers2

Linked