I am implementing a clustering algorithm on a large dataset. The dataset is in a text file and it contains over 100 million records. Each record contains 3 numeric fields.
1,1503895,4
3,2207774,5
6,2590061,3
...
I need to keep all this data in memory if possible, since as per my clustering algorithm, I need to randomly access records in this file. There fore I can't do any partition and merging approaches as described in Find duplicates in large file
What are possible solutions to this problem? Can I use caching techniques like ehcache?