0

I'm trying to read a file and save the lines which share the same first token (readId) in a set(of String). Each set is part of my hashmap >.

I already increased my heap to 32 giga, also move from string.split to StringTokenizer, but still I am having this error:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOfRange(Arrays.java:2694)
    at java.lang.String.<init>(String.java:203)
    at java.lang.String.substring(String.java:1913)
    at java.util.StringTokenizer.nextToken(StringTokenizer.java:352)
    at java.util.StringTokenizer.nextElement(StringTokenizer.java:407)
    at Simple1_BootStrap.createMapSet(Simple1_BootStrap.java:68)
    at Simple1_BootStrap.main(Simple1_BootStrap.java:206)

Previously, the "out of memory error" was generated by this line:

Set<String> s =new TreeSet<String>();

The piece of the code producing the error is:

Map<String,Set<String>> map2 = new HashMap<String,Set<String>>();

    try{          
          BufferedReader br = new BufferedReader(new FileReader(filename)); 

          String strLine;
          String readId; 
          while ((strLine = br.readLine()) != null)   {
              alignment ++;
              StringTokenizer stringTokenizer = new StringTokenizer(strLine);

              readId = stringTokenizer.nextElement().toString();  

              if(map2.containsKey(readId)) {
                    Set<String> s = map2.get(readId);
                    s.add(strLine);
                    map2.put(readId, s);
                  }
                  else {
                      Set<String> s =new TreeSet<String>();
                      s.add(strLine);
                      map2.put(readId, s);
                  }
          }

          br.close();         
                      }catch (Exception e){//Catch exception if any
              System.err.println("Error: " + e.getMessage());
          }

I put those lines inside a set because I need to randomly select entries in my hashmap and read the associated set to create a file similar to the input file.

Could somebody pls suggest another approach to avoid the "out of memory error"?

Thank you.

DT7
  • 1,615
  • 14
  • 26
  • 1
    Do you really need it all in memory? For something so large I'd be more inclined to use an on disk key store, like the berkeley db. – FatalError Nov 14 '13 at 16:12
  • 1
    Trying to load all of this into memory might not be the best approach. If we are ignoring that: How much physical RAM do you have in your computer? – reto Nov 14 '13 at 16:12
  • 1
    Imo, don't use memory for storage of that magnitude. Consider a database. – Taylor Nov 14 '13 at 16:12
  • I use the server about 64 Giga – user2992844 Nov 14 '13 at 16:13
  • FatalError and Taylor, you are right about database because I am considering it too but I am afraid of: (1) performance and (2)distribution, since this is part of an application I may like to distribute with an easy installation package – user2992844 Nov 14 '13 at 16:17

2 Answers2

2

Regardless of the wisdom of loading everything into memory, String.substring() holds a reference to the original (larger) string for versions of Java prior to recent builds of Java 7. As such you're probably holding on to a lot more memory than you imagine. See this question/answer for more details.

Using the String(String) constructor to build a new string from the StringTokenizer results will mitigate this, as will upgrading to a recent Java 7 runtime.

Community
  • 1
  • 1
Brian Agnew
  • 268,207
  • 37
  • 334
  • 440
  • I like how you word it "Regardless of the wisdom..." +1. Another +1 would be for the Frank Zappa reference but I can only give 1. – Adam Arold Nov 14 '13 at 16:16
  • I was thinking the same thing, and was surprised to find this isn't the OP's case. You can see in the stacktrace for the OOME that substring is allocating a new array. I didn't know they finally changed that in the Oracle library. – Mark Peters Nov 14 '13 at 16:18
2

When you read a String, you should expect it to use 2-4x as much memory as it does on file. This is because each character uses two bytes, but each String object + char[] uses about 80 bytes of memory e.g. a String of 4 characters uses about 88 bytes.

When you add this to a HashMap you need about 100 bytes for each record.

In short I would try a heap of at least 100 GB assuming you have much more main memory than this.


A solution:

If you don't have this much memory I suggest you rethink your approach. E.g. you could memory map the file so it is not on the heap at all and uses a Trove collection to refer to your data by index without using an object for the index.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • Thank you Peter. Actually I have 256 Giga memory on the server but hardly have 64 Giga available at one point in time. Could you pls explain the use of a Trove collection or perhaps send a pointer to a good resource? I appreciate you guys input a lot because for the past two weeks, I am just looking at ways to overcome this problem. Thank you – user2992844 Nov 14 '13 at 17:52
  • Trove (http://trove.starlight-systems.com/) allows you to have a key or value which is a primitive. This reduces the memory consumption. If the source text is not on the heap, this also reduces the memory consumption as the number of bytes is unchanged and uses very little heap. – Peter Lawrey Nov 14 '13 at 21:51