0

I am implementing a clustering algorithm on a large dataset. The dataset is in a text file and it contains over 100 million records. Each record contains 3 numeric fields.

1,1503895,4
3,2207774,5
6,2590061,3
...

I need to keep all this data in memory if possible, since as per my clustering algorithm, I need to randomly access records in this file. There fore I can't do any partition and merging approaches as described in Find duplicates in large file

What are possible solutions to this problem? Can I use caching techniques like ehcache?

Community
  • 1
  • 1
ravindrab
  • 2,712
  • 5
  • 27
  • 37
  • 1
    Setup the VM with a lot of memory? Other than that... – SJuan76 Jan 26 '13 at 00:15
  • 2
    how large is the text file ? Use data types that suit. looks like a byte,int,byte ? – exussum Jan 26 '13 at 00:20
  • I'm with @SJuan76. It sounds like your dataset is in the ~1-2GB range (representing each field as an int), which most any decent machine has. See http://stackoverflow.com/questions/2294268/how-can-i-increase-the-jvm-memory for how to set your JVM maximum heap size. – Nicu Stiurca Jan 26 '13 at 00:21
  • Alternatively, if some preprocessing is in order you may: a) make sure that all the records are the same length (in bytes) to use `RandomAccessFile` to read each record (do not know about how efficient will be, maybe it is dependent on FS); or b) partition the data in chunks of 100 records (or so), to read record 2050 you will need to open file 200 and read 50th record. – SJuan76 Jan 26 '13 at 00:26

1 Answers1

0

300 million ints shouldnt consume that much memory. Try instantiating an array of 300 million ints. Back of my hand calculation, on a 64 bit machine, is about 1.2 GB.

Kyle
  • 1,019
  • 1
  • 10
  • 19
  • Thanks for the input. Earlier I was using a `Hashmap` to store my Record objects. Now I changed it to a `ArrayList` instead of a Hashmap. Now I am able to fit all my objects in to memory. – ravindrab Jan 26 '13 at 01:05
  • 2
    Why should the CPU architecture (32 or 64 bits) matter when estimating how much memory you need for an array with 300 million ints? – jarnbjo Jan 26 '13 at 01:41
  • 1
    @jarnbjo memory addresses take up more memory in a 64 bit environment. Depending on the vm there may or may not be padding between ints, but it holds true using wrappers of primitive types – Kyle Jan 26 '13 at 04:26
  • @AlanB for even more savings, at the expense of the extra complexity ofmanaging the size of the array, try using an array of primitive int. There is less overhead involved in storing pointers to wrappers. Also keep in mind that the size of the arraylists underlying array will double when filled. This can cause a very large memory consumpion jump when exceeded by a small number of elements – Kyle Jan 26 '13 at 04:29