Using Hashmap with set to read large file (20 giga) in Java — Java heap space

Question

I'm trying to read a file and save the lines which share the same first token (readId) in a set(of String). Each set is part of my hashmap >.

I already increased my heap to 32 giga, also move from string.split to StringTokenizer, but still I am having this error:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOfRange(Arrays.java:2694)
    at java.lang.String.<init>(String.java:203)
    at java.lang.String.substring(String.java:1913)
    at java.util.StringTokenizer.nextToken(StringTokenizer.java:352)
    at java.util.StringTokenizer.nextElement(StringTokenizer.java:407)
    at Simple1_BootStrap.createMapSet(Simple1_BootStrap.java:68)
    at Simple1_BootStrap.main(Simple1_BootStrap.java:206)

Previously, the "out of memory error" was generated by this line:

Set<String> s =new TreeSet<String>();

The piece of the code producing the error is:

Map<String,Set<String>> map2 = new HashMap<String,Set<String>>();

    try{          
          BufferedReader br = new BufferedReader(new FileReader(filename)); 

          String strLine;
          String readId; 
          while ((strLine = br.readLine()) != null)   {
              alignment ++;
              StringTokenizer stringTokenizer = new StringTokenizer(strLine);

              readId = stringTokenizer.nextElement().toString();  

              if(map2.containsKey(readId)) {
                    Set<String> s = map2.get(readId);
                    s.add(strLine);
                    map2.put(readId, s);
                  }
                  else {
                      Set<String> s =new TreeSet<String>();
                      s.add(strLine);
                      map2.put(readId, s);
                  }
          }

          br.close();         
                      }catch (Exception e){//Catch exception if any
              System.err.println("Error: " + e.getMessage());
          }

I put those lines inside a set because I need to randomly select entries in my hashmap and read the associated set to create a file similar to the input file.

Could somebody pls suggest another approach to avoid the "out of memory error"?

Thank you.

Do you really need it all in memory? For something so large I'd be more inclined to use an on disk key store, like the berkeley db. — FatalError, Nov 14 '13 at 16:12
Trying to load all of this into memory might not be the best approach. If we are ignoring that: How much physical RAM do you have in your computer? — reto, Nov 14 '13 at 16:12
Imo, don't use memory for storage of that magnitude. Consider a database. — Taylor, Nov 14 '13 at 16:12
FatalError and Taylor, you are right about database because I am considering it too but I am afraid of: (1) performance and (2)distribution, since this is part of an application I may like to distribute with an easy installation package — user2992844, Nov 14 '13 at 16:17

score 2 · Answer 1 · edited May 23 '17 at 11:50

2

Regardless of the wisdom of loading everything into memory, String.substring() holds a reference to the original (larger) string for versions of Java prior to recent builds of Java 7. As such you're probably holding on to a lot more memory than you imagine. See this question/answer for more details.

Using the String(String) constructor to build a new string from the StringTokenizer results will mitigate this, as will upgrading to a recent Java 7 runtime.

edited May 23 '17 at 11:50

Community

1
1

answered Nov 14 '13 at 16:13

Brian Agnew

268,207
37
334
440

I like how you word it "Regardless of the wisdom..." +1. Another +1 would be for the Frank Zappa reference but I can only give 1. – Adam Arold Nov 14 '13 at 16:16
I was thinking the same thing, and was surprised to find this isn't the OP's case. You can see in the stacktrace for the OOME that substring is allocating a new array. I didn't know they finally changed that in the Oracle library. – Mark Peters Nov 14 '13 at 16:18

score 2 · Answer 2 · answered Nov 14 '13 at 16:16

2

When you read a String, you should expect it to use 2-4x as much memory as it does on file. This is because each character uses two bytes, but each String object + char[] uses about 80 bytes of memory e.g. a String of 4 characters uses about 88 bytes.

When you add this to a HashMap you need about 100 bytes for each record.

In short I would try a heap of at least 100 GB assuming you have much more main memory than this.

A solution:

If you don't have this much memory I suggest you rethink your approach. E.g. you could memory map the file so it is not on the heap at all and uses a Trove collection to refer to your data by index without using an object for the index.

answered Nov 14 '13 at 16:16

Peter Lawrey

525,659
79
751
1,130

Thank you Peter. Actually I have 256 Giga memory on the server but hardly have 64 Giga available at one point in time. Could you pls explain the use of a Trove collection or perhaps send a pointer to a good resource? I appreciate you guys input a lot because for the past two weeks, I am just looking at ways to overcome this problem. Thank you – user2992844 Nov 14 '13 at 17:52
Trove (http://trove.starlight-systems.com/) allows you to have a key or value which is a primitive. This reduces the memory consumption. If the source text is not on the heap, this also reduces the memory consumption as the number of bytes is unchanged and uses very little heap. – Peter Lawrey Nov 14 '13 at 21:51

Using Hashmap with set to read large file (20 giga) in Java — Java heap space

2 Answers2