1

I am working on a project that evaluates web pages on a large scale (millions). Unlike the typical web crawler, my evaluator needs to build a JDOM object for each page it evaluates in order to run XPath queries on it.

Our current implementation uses SAXBuilder to instantiate JDOM objects (some are then cached for potential future use) in order to query XPaths. But simple profiling shows that the instantiation process actually consumes the most time, much more than querying XPaths and so I am left looking for alternative solutions. Is there any way to reduce this overhead by for example:

  1. Creating "lean" JDOM objects with only minimal structural information of the page?
  2. Evaluating XPaths without an actual JDOM object?
  3. Initialising JDOM objects faster by re-using the object for a "similar" page?

EDIT: We are using JDOM 2.X.

A sample of the way we initialize a JDOM object:

public static List<Element> evaluateHTML(File html, String pattern) throws JDOMException, IOException {
    Element page = saxBuilder.build(html).getRootElement();
    XPathBuilder xpath = new XPathBuilder(pattern, Filters.element());
    //...set xpath namespaces...
    XPathExpression expr = xpath.compileWith(xpathFactory);
    return expr.evaluate(page);
}

Where xpathFactory is a static member and evaluateHTML is invoked for every html file we evaluate.

eladidan
  • 2,634
  • 2
  • 26
  • 39
  • +1 Good question. Can you show us a bit of code showing how you instantiate the JDOM objects? That will help narrow down the possible answers. – LarsH Jul 28 '14 at 18:55
  • Also, are you using JDOM 1.x or 2.x? See http://stackoverflow.com/a/16316589/423105 for example. – LarsH Jul 28 '14 at 18:59
  • See also the two answers under http://stackoverflow.com/q/10116891/423105 – LarsH Jul 28 '14 at 19:00

0 Answers0