3

I found out the memory my program is increasing is because of the code below, currently I am reading a file that is about 7GB big, and I believe the one that would be stored in the hashset is lesson than 10M, but the memory my program keeps increasing to 300MB and then crashes because of OutofMemoryError. If it is the Hashset problem, which data structure shall I choose?

    if(tagsStr!=null) {
        if(tagsStr.contains("a")||tagsStr.contains("b")||tagsStr.contains("c")) {
            maTable.add(postId);
        }
    } else {
        if(maTable.contains(parentId)) {
            //do sth else, no memories added here
        }
    }
faz
  • 313
  • 5
  • 12
  • 2
    I think it is unlikely a HashSet problem unless you are putting a lot of data in it. What is the size of the strings you are storing? Are you reading the entire file into memory or one line at a time? The data you have provided here does not really give enough information to help. – John B Nov 07 '11 at 15:02
  • How many items does your table contain before crashing? – Steven Jeuris Nov 07 '11 at 15:02
  • 1
    And what is the average length / size of the elements? – John B Nov 07 '11 at 15:03
  • look at http://www.javaspecialists.eu/archive/Issue193.html – Alpedar Nov 07 '11 at 15:04
  • I just followed the size of the HashSet, when it cracks, it has 86,000 elements of String in it, are they the reason to the memory failure? – faz Nov 07 '11 at 22:10

4 Answers4

7

You haven't really told us what you're doing, but:

  • If your file is currently in something like ASCII, each character you read will be one byte in the file or two bytes in memory.
  • Each string will have an object overhead - this can be significant if you're storing lots of small strings
  • If you're reading lines with BufferedReader (or taking substrings from large strings), each one may have a large backing buffer - you may want to use maTable.add(new String(postId)) to avoid this
  • Each entry in the hash set needs a separate object to keep the key/hashcode/value/next-entry values. Again, with a lot of entries this can add up

In short, it's quite possible that you're doing nothing wrong, but a combination of memory-increasing factors are working against you. Most of these are unavoidable, but the third one may be relevant.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • Are you sure about your third point - "reading lines with BufferedReader"? I thought BufferedReader took care to use new String(...) itself. I agree with the bit about substrings. – Paul Cager Nov 07 '11 at 15:28
  • @PaulCager: It certainly didn't last time I looked. Last I checked, it read into a buffer char array (defaulting to 80 chars IIRC), and then created a new `String` which is a view onto that char array. If the array was much bigger than the "usable string" then you can waste a lot of memory. This was a while ago, so it may have changed. – Jon Skeet Nov 07 '11 at 15:33
  • It looks to have changed - it will now either return a String built via a StringBuffer (if overflows cb), or do a "str = new String(cb, startChar, i - startChar);" where cb is the buffer. – Paul Cager Nov 07 '11 at 15:41
  • @PaulCager: Right - but in both cases that could end up with a larger backing buffer than the logical string. – Jon Skeet Nov 07 '11 at 15:47
4

You've either got a memory leak or your understanding of the amount of string data that you are storing is incorrect. We can't tell which without seeing more of your code.

The scientific solution is to run your application using a memory profiler, and analyze the output to see which of your data structures is using an unexpectedly large amount of memory.


If I was to guess, it would be that your application (at some level) is doing something like this:

String line;
while ((line = br.readLine()) != null) {
    // search for tag in line
    String tagStr = line.substring(pos1, pos2);
    // code as per your example
}

This uses a lot more memory than you'd expect. The substring(...) call creates a tagStr object that refers to the backing array of the original line string. Your tag strings that you expect to be short actually refer to a char[] object that holds all characters in the original line.

The fix is to do this:

    String tagStr = new String(line.substring(pos1, pos2));

This creates a String object that does not share the backing array of the argument String.

UPDATE - this or something similar is an increasingly likely explanation ... given your latest data.


To expand on another of Jon Skeet's point, the overheads of a small String are surprisingly high. For instance, on a typical 32 bit JVM, the memory usage of a one character String is:

  • String object header for String object: 2 words
  • String object fields: 3 words
  • Padding: 1 word (I think)
  • Backing array object header: 3 words
  • Backing array data: 1 word

Total: 10 words - 40 bytes - to hold one char of data ... or one byte of data if your input is in an 8-bit character set.

(This is not sufficient to explain your problem, but you should be aware of it anyway.)

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • I'd like to add that in general it is also possible that sharing of backing arrays of Strings leads to a reduced memory consumption. This depends on how many strings share a backing array, and which part of the backing array isn't used by any of the Strings. – jmg Nov 08 '11 at 09:09
  • It is theoretically possible, but it seems unlikely in the OP's case. – Stephen C Nov 08 '11 at 10:01
0

Couldn't be it possible that the data read into memory (from the 7G file) is somehow not freed? Something ike Jon puts... ie. since strings are immutable every string read requires a new String object creation which might lead to out of memory if GC is not quick enough...

If the above is the case than you might insert some 'breakpoints' into your code/iteration, ie. at some defined points, issue gc and wait till it terminates.

Gyula
  • 54
  • 2
  • You will not get an OOM if GC isn't quick enough. If necessary GC will pause the whole VM and do a stop-the-world collection before throwing OOM. – Paul Cager Nov 07 '11 at 15:31
  • Thank you for your response. In fact I was unaware of the fact that GC is triggered automatically in a blocking mode if necessarry:) However it seems that even with this method an OOM might still occur. See the related question: http://stackoverflow.com/questions/1393486/what-does-the-error-message-java-lang-outofmemoryerror-gc-overhead-limit-excee and specifically the official article it references: http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html#par_gc.oom – Gyula Nov 08 '11 at 13:44
  • "The parallel collector will throw an OutOfMemoryError if too much time is being spent in garbage collection: if more than 98% of the total time is spent in garbage collection and less than 2% of the heap is recovered, an OutOfMemoryError will be thrown." That is if GC has too much to work (too much object too free) an OOM might be thrown. Hence in such cases application-driven GC might help, might not? – Gyula Nov 08 '11 at 13:52
0

Run your program with -XX:+HeapDumpOnOutOfMemoryError. You'll then be able to use a memory analyser like MAT to see what is using up all of the memory - it may be something completely unexpected.

Paul Cager
  • 1,910
  • 14
  • 21
  • Thanks, I tried using MAT, but it kept failing like this: Dumping heap to java_pid4080.hprof ... Dump file is incomplete: Not enough space Is there any way to solve this? – faz Nov 07 '11 at 20:40
  • 1
    It sounds like you have run out of disk space. – Paul Cager Nov 07 '11 at 23:15