5

I'm running a simple scanner to parse a string, however I've discovered that if called often enough I get OutOfMemory errors. This code is called as part of the constructor of an object that is built repeatedly for an array of strings :

Edit: Here's the constructor for more infos; not much more happening outside of the try-catch regarding the Scanner

   public Header(String headerText) {
        char[] charArr;
        charArr = headerText.toCharArray();
        // Check that all characters are printable characters
        if (charArr.length > 0 && !commonMethods.isPrint(charArr)) {
            throw new IllegalArgumentException(headerText);
        }
        // Check for header suffix
        Scanner sc = new Scanner(headerText);
        MatchResult res;
        try {
            sc.findInLine("(\\D*[a-zA-Z]+)(\\d*)(\\D*)");
            res = sc.match();
        } finally {
            sc.close();
        }

        if (res.group(1) == null || res.group(1).isEmpty()) {
            throw new IllegalArgumentException("Missing header keyword found");     // Empty header to store
        } else {
            mnemonic = res.group(1).toLowerCase();                            // Store header
        }
        if (res.group(2) == null || res.group(2).isEmpty()) {
            suffix = -1;
        } else {
            try {
                suffix = Integer.parseInt(res.group(2));       // Store suffix if it exists
            }  catch (NumberFormatException e) {
                throw new NumberFormatException(headerText);
            }
        }
        if (res.group(3) == null || res.group(3).isEmpty()) {
            isQuery= false;
        } else {
            if (res.group(3).equals("?")) {
                isQuery = true;
            } else {
                throw new IllegalArgumentException(headerText);
            }
        }

        // If command was of the form *ABC, reject suffixes and prefixes
        if (mnemonic.contains("*") 
                && suffix != -1) {
            throw new IllegalArgumentException(headerText);
        }
    }

A profiler memory snapshot shows the read(Char) method of Scanner.findInLine() to be allocated massive amounts of memory during operation as a I scan through a few hundred thousands strings; after a few seconds it already is allocated over 38MB.

enter image description here

I would think that calling close() on the scanner after using it in the constructor would flag the old object to be cleared by the GC, but somehow it remains and the read method accumulates gigabytes of data before filling the heap.

Can anybody point me in the right direction?

darkhelmet
  • 123
  • 1
  • 2
  • 7
  • 1
    What happens to the `MatchResult` after this code completes? – Russell Zahniser Mar 21 '13 at 17:39
  • Are you sure the memory won't be freed when the GC does happen? I would guess that memory is available to be GC'd, but the GC doesn't happen when you expect it to - rather, after some threshold is reached, the next memory allocation will cause the GC to occur, and result in a lot of freed memory... – Rob I Mar 21 '13 at 17:40
  • 2
    @RobI: He's getting an `OutOfMemoryException`, and the JVM guarantees that it will run garbage collection before it will throw OOM. – Russell Zahniser Mar 21 '13 at 17:43
  • `sc.findInLine("(\\D*[a-zA-Z]+)(\\d*)(\\D*)");` This line suppose to return String. Most probably your program is getting hanged at this line every time you run. – Smit Mar 21 '13 at 17:45
  • @RussellZahniser is `OutOfMemoryError` – Luiggi Mendoza Mar 21 '13 at 17:47
  • @RusselZhaniser MatchResult res is processed for groups and then the constructor ends; – darkhelmet Mar 21 '13 at 17:48
  • @Smit findInLine correctly returns a string – darkhelmet Mar 21 '13 at 17:50
  • @darkhelmet Regex get matched to entered line through console. If its gets matched then it return strings. If not then most probably it will get hang and code will not run further. Is this happening to your program? Can you show your whole code? – Smit Mar 21 '13 at 17:53

6 Answers6

2

You haven't posted all your code, but given that you are scanning for the same regex repeatedly, it would be much more efficient to compile a static Pattern beforehand and use this for the scanner's find:

static Pattern p = Pattern.compile("(\\D*[a-zA-Z]+)(\\d*)(\\D*)");

and in the constructor:

sc.findInLine(p);

This may or may not be the source of the OOM issue, but it will definitely make your parsing a bit faster.

Related: java.util.regex - importance of Pattern.compile()?

Update: after you posted more of your code, I see some other issues. If you're calling this constructor repeatedly, it means you are probably tokenizing or breaking up the input beforehand. Why create a new Scanner to parse each line? They are expensive; you should be using the same Scanner to parse the entire file, if possible. Using one Scanner with a precompiled Pattern will be much faster than what you are doing now, which is creating a new Scanner and a new Pattern for each line you are parsing.

Community
  • 1
  • 1
Andrew Mao
  • 35,740
  • 23
  • 143
  • 224
  • 1
    These are good suggestions, but none of the problems you point out here would cause a memory leak. – Russell Zahniser Mar 21 '13 at 18:10
  • Yeah, they reason the tokenizers are split is because the first one does not know how to parse the content of a line. The user input is split into substrings of commands which are then individualy built a around that substring, only the command object knows how to properly read it. – darkhelmet Mar 21 '13 at 20:18
1

The strings that are filling up your memory were created in findInLine(). Therefore, the repeated Pattern creation is not the problem.

Without knowing what the rest of the code does, my guess would be that one of the groups you get out of the matcher is being kept in a field of your object. Then that string would have been allocated in findInLine(), as you see here, but the fact that it is being retained would be due to your code.

Edit:

Here's your problem:

mnemonic = res.group(1).toLowerCase();

What you might not realize is that toLowerCase() returns this if there are no uppercase letters in the string. Also, group(int) returns a substring(), which creates a new string backed by the same char[] as the full string. So, mnemonic actually contains the char[] for the entire line.

The fix would just be:

mnemonic = new String(res.group(1).toLowerCase());
Russell Zahniser
  • 16,188
  • 39
  • 30
  • 1
    You are most likely right. But please note that the behavior of `String.substring()` has been changed by Java7u6 (see [Bug #4513622](http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4513622)). The substring is now copied by default rather than pointing on the char buffer of the original string. – Clément MATHIEU Mar 21 '13 at 18:48
  • And @darkhelmet could use a jvisualvm heap dump to verify where in his code the references still exist. – Rob I Mar 21 '13 at 21:29
  • Thanks @RobI, I was able to trace back the problem through the references; it was to do with a list holding the object constructed. – darkhelmet Mar 22 '13 at 17:41
0

I think that your code snippet is not full. I believe you are calling scanner.findInLine() in loop. Anyway, try to call scanner.reset(). I hope this will solve your problem.

AlexR
  • 114,158
  • 16
  • 130
  • 208
0

The JVM apparently does not have time to Garbage collect. Possibly because it's using the same code (the constructor) repeatedly to create multiple instances of the same class. The JVM may not do anything about GC until something changes on the run time stack -- and in this case that's not happening. I've been warned in the past about doing "too much" in a constructor as some of the memory management behaviors are not quite the same when other methods are being called.

nosmadar
  • 11
  • 1
0

Your problem is that you are scanning through a couple hundred thousand strings and you are passing the pattern in as a string, so you have a new pattern object for every single iteration of the loop. You can pull the pattern out of the loop, like so:

    Pattern toMatch = Pattern.compile("(\\D*[a-zA-Z]+)(\\d*)(\\D*)")

    Scanner sc = new Scanner(headerText);
    MatchResult res;

    try {
        sc.findInLine(toMatch);
        res = sc.match();
    } finally {
        sc.close();
    }

Then you will only be passing the object reference to toMatch instead of having the overhead of creating a new pattern object for every attempt at a match. This will fix your leak.

CaTalyst.X
  • 1,645
  • 13
  • 16
  • This is a performance improvement, but is absolutely not related to the memory leak. – Clément MATHIEU Mar 21 '13 at 18:37
  • @AndrewMao, I opened this up before work and didnt finish the answer til after. Sorry about that. – CaTalyst.X Mar 22 '13 at 16:27
  • @ClémentMATHIEU, I believe this IS the memory leak. That loop is generating hundreds of thousands of pattern objects before the GC has a chance to run. If you are going to contradict my answer at least provide some evidence. – CaTalyst.X Mar 22 '13 at 16:28
  • @AndrewMao: By definition if the objects are no longer referenced and will eventually be garbaged this is not a memory leak. It is a performance issue. The OP is looking for a **memory leak**, and says that objects survive to GC collection so your answer is definitely wrong. – Clément MATHIEU Mar 22 '13 at 16:40
0

Well I've found the source of the problem, it wasn't Scanner exactly but the list holding the objects doing the scanning in the constructor.

The problem had to do with the overrun of a list that was holding references to the object containing the parsing, essentially more strings were received per unit of time than could be processed and the list grew and grew until there were no more RAM. Bounding this list to a maximum size now prevents the parser from overloading the memory; I'll be adding some synchronization between the parser and the data source to avoid this overrun in the future.

Thank you all for your suggestions, I've already made some changes performance wise regarding the scanner and thank you to @RobI for pointing me to jvisualvm which allowed me to trace back the exact culprits holding the references. The memory dump wasn't showing the reference linking.

darkhelmet
  • 123
  • 1
  • 2
  • 7