In Java
, having a file of 335Gb
size that contains individual numbers at each line, I need to read it line by line like if it was a stream of numbers - I must not keep all the data in memory. I was told that Scanner
class will not work. Could you please recommend the best possible way to do that?
Asked
Active
Viewed 408 times
0

Danny Brown
- 9
- 1
-
1Use `BufferedReader`. – Luiggi Mendoza Feb 06 '15 at 17:14
-
1335 gb ? that's a huge one... can u post some sample data to see if we can tailor our solution to the data format ? – Arkantos Feb 06 '15 at 17:22
-
2Why won't Scanner work? BTW I assume you mean 335 GB = Giga-byte instead of Gb = giga-bit. – Peter Lawrey Feb 06 '15 at 17:43
-
If you are reading a large file exactly once from beginning to end, it doesn’t matter which method you will use, they all will end up having roughly the same performance. You can’t outsmart the hard drive. – Holger Feb 06 '15 at 17:48
-
`Scanner` sounds just fine, to be honest. – Louis Wasserman Feb 06 '15 at 17:54
2 Answers
3
None of the java.io input stream classes would "keep all the data in memory". I think you are free to choose what is best for you such as BufferedReader or DataInputStream etc.

Joe
- 31
- 1
1
If you use BufferedReader you should be able to get up to 90 MB/s in one thread.
You can use trick to break up the file and read portion of the data concurrently, but this will only help if your disk read through put is high.
For example you can memory map 335 GB into memory at once without using the heap much. This will work even if you have a fraction of this amount of main memory.
What is the read transfer rate you can get with your disk subsystem?

Peter Lawrey
- 525,659
- 79
- 751
- 1,130
-
Why the reference to the hard number of 90 MB/s? My system surely allows more, others may be slower. I doubt that any trick will accelerate a tasks as simple as described. – Holger Feb 06 '15 at 17:52
-
@Holger The 90 MB/s is for a typical fast processor. If there is spare read throughput capacity, using memory mapped files and multiple threads can help reach your maximum read throughput. e.g. I have exceeded 1.2 GB/s using SSD and memory mapped files. – Peter Lawrey Feb 06 '15 at 17:55
-
Multi-threading is unlikely to accelerate I/O which goes serially though one bus. If you are talking about 1.2GB/s then parsing the numbers in parallel might indeed improve the though-put but that’s actually proving that on your system the I/O is *not* the bottleneck. So I don’t believe that the same system only allows 90MB/s when using `BufferedReader`… – Holger Feb 06 '15 at 18:02