1

There appears to be a memory leak when using the standard Java library (1.6.0_27) for evaluating XPath expressions.

See below for some code to reproduct this problem:

public class XpathTest {

    public static void main(String[] args) throws Exception {
        DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
        docFactory.setNamespaceAware(true);
        DocumentBuilder builder = docFactory.newDocumentBuilder();
        Document doc = builder.parse("test.xml");

        XPathFactory factory = XPathFactory.newInstance();
        XPath xpath = factory.newXPath();
        XPathExpression expr = xpath.compile("//Product");

        Object result = expr.evaluate(doc, XPathConstants.NODESET);
        NodeList nodes = (NodeList) result;
        for (int i = 0; i < nodes.getLength(); i++) {
            Node node = nodes.item(i);
            System.out.println(node.getAttributes().getNamedItem("id"));

            XPathExpression testExpr = xpath.compile("Test");
            Object testResult = testExpr.evaluate(node, XPathConstants.NODE);
            Node test = (Node) testResult;
            System.out.println(test.getTextContent());
        }
        System.out.println(nodes.getLength());
    }
}

An example XML file is given below:

<Products>
  <Product id='ID0'>
    <Test>0</Test>
  </Product>
  <Product id='ID1'>
    <Test>1</Test>
  </Product>
  <Product id='ID2'>
    <Test>2</Test>
  </Product>
  <Product id='ID3'>
    <Test>3</Test>
  </Product>
  ...
</Products>

When I run this example using the NetBeans profiler it appears that the allocations for the com.sun.org.apache.xpath.internal.objects.XObject class keeps increasing, even after garbage collection.

Am I using the XPath library in an incorrect way? Is this a bug in the Java libraries? Are there are potential workarounds?

Bob
  • 457
  • 1
  • 5
  • 15
  • Hmm, that would be quite interesting. How did you test your assumptions? With a profiler? How long is your sample XML file? It may well be that there is an internal cache to accelerate subsequent calls to `evaluate`... – Lukas Eder Sep 08 '11 at 07:13
  • The sample XML file has 100,000 records. I am using the NetBeans profiler and the objects allocated for the type com.sun.org.apache.xpath.internal.objects.XObject is continually increasing as the file is parsed. – Bob Sep 08 '11 at 07:35
  • That is a lot of records. For performance (not only memory) reasons, you should avoid using XPath and prefer the DOM API where possible (see also [my benchmark here](http://stackoverflow.com/questions/6340802/java-xpath-apache-jaxp-implementation-performance)). – Lukas Eder Sep 08 '11 at 08:14
  • In the actual application I am using Stax based parsing. I just used DOM in the example to keep things simple. – Bob Sep 08 '11 at 08:27
  • 1
    That might change your whole problem setup. You should then post your actual implementation... – Lukas Eder Sep 08 '11 at 08:32

3 Answers3

2

Don't know if this might be causing the memory leak, but:

XPathExpression testExpr = xpath.compile("Test");

Don't do this in the for loop. Compile it once outside the for loop and reuse it. Maybe the XPath object is caching all the expressions you are compiling for reuse?

Lukas Eder
  • 211,314
  • 129
  • 689
  • 1,509
prunge
  • 22,460
  • 3
  • 73
  • 80
  • 3
    That's certainly true, although I have found that `compile` only makes up for a very small amount of CPU and memory consumption, compared to `XPathFactory.newInstance()` and `expr.evaluate()` (see these [benchmarks here](http://stackoverflow.com/questions/6340802/java-xpath-apache-jaxp-implementation-performance)) – Lukas Eder Sep 08 '11 at 07:16
  • I have already tried that, but no luck. The problem appears to be with the evaluate method. If I comment out the evaluate statement then there is no leak. – Bob Sep 08 '11 at 07:32
2

There is no "memory leak" in this case. Memory leak are defined as instances where an application cannot reclaim memory. In this case there is no leak, as all XObject (and XObject[]) instances can be reclaimed at some point in time.

A memory profiler snapshot obtained from VisualVM yields the following observations:

  • All XObject (and XObject[]) instances are created when the XPathExpression.evaluate method is invoked.
  • XObject instances are reclaimed when they are no longer reachable from a GC root. In your case, the GC roots are the result and testResult local variables which are local to the stack of the main thread.

Based on the above, I suppose that your application is experiencing or likely to experience a memory exhaustion as opposed to a memory leak. This is true when you have a large number of XObject/XObject[] instances from an XPath expression evaluation, that haven't been reclaimed by the garbage collector because

  • they are either still reachable from a GC root,
  • or the garbage collector hasn't come around to reclaiming them yet.

The only solution to the first is to retain objects around in memory for the duration that they are required. You do not seem to be violating that in your code, but your code could certainly be made more efficient - you are retaining the result of the first XPath expression, to be used by the second, when certainly it could be performed more efficiently. //Product/Test can be used to retrieve the Test nodes, and also obtain the parent Product Nodes' id values are shown in the following snippet (which evaluates only one XPath expression instead of two):

expr = xpath.compile("//Product/Test");
nodes = (NodeList) expr.evaluate(doc, XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); i++)
{
    Node node = nodes.item(i);
    System.out.println(node.getParentNode().getAttributes().getNamedItem("id"));
    System.out.println(node.getTextContent());
}
System.out.println(nodes.getLength());

As far as the second observation is concerned, you ought to obtain GC logs (using the verbose:gc JVM startup flag). You could then decide to resize the young generation, if you have too many shortlived objects being created, as there is the possible likelihood that reachable objects will be moved to the tenured generation resulting in the likelihood that a major collection will be required to reclaim objects that are actually shortlived by nature. In an ideal scenario (considering your posted code), a young gen collection cycle should be done every few iterations of the for loop, as the XObject instances that are local to the loop, should be reclaimed as soon as the block's local variables go out of scope.

Vineet Reynolds
  • 76,006
  • 17
  • 150
  • 174
  • The program included was just a test program to reproduce a problem that I found in my application. In the application I actually need to process fragments which are stored in a database and extract attributes from these fragments using XPath expressions. There could be potentially millions of product records, and this will require millions of Xpath expression evaluations. I can have a look into the GC suggestion, but I would have thought that the GC would be able to reclaim the memory if I left the application running for long enough. – Bob Sep 08 '11 at 08:25
  • @Bob, there are atleast two kinds of GC cycles. If your shortlived objects can live beyond several young generation GC cycles, they will be promoted to the tenured generation once the young generation fills up. At that point in time, you need a major collection and not a minor collection to reclaim these objects. That's why you'll need to resize the young generation to be larger (I believe the default is 4M), so that there is a higher possibility that a young gen cycle (that occurs more frequently in this case) will find most objects to be unreachable from the GC roots. – Vineet Reynolds Sep 08 '11 at 08:33
0

You say: "the objects allocated for the type com.sun.org.apache.xpath.internal.objects.XObject is continually increasing as the file is parsed".

I think you will find that is by design. I don't know the internals of the Apache tools, but you must expect a normal (non-streaming) DOM and XPath implementation to use an amount of memory that is proportional to the source document size.

So I would expect the memory requirement to increase as the source document is parsed. I wouldn't expect it to increase as more XPath expressions are executed against that document (after discounting effects that some of the tree building is done lazily, the first time each node is accessed.)

Michael Kay
  • 156,231
  • 11
  • 92
  • 164