I am working on a project that evaluates web pages on a large scale (millions). Unlike the typical web crawler, my evaluator needs to build a JDOM
object for each page it evaluates in order to run XPath
queries on it.
Our current implementation uses SAXBuilder
to instantiate JDOM objects (some are then cached for potential future use) in order to query XPaths. But simple profiling shows that the instantiation process actually consumes the most time, much more than querying XPaths and so I am left looking for alternative solutions. Is there any way to reduce this overhead by for example:
- Creating "lean" JDOM objects with only minimal structural information of the page?
- Evaluating XPaths without an actual JDOM object?
- Initialising JDOM objects faster by re-using the object for a "similar" page?
EDIT:
We are using JDOM
2.X.
A sample of the way we initialize a JDOM
object:
public static List<Element> evaluateHTML(File html, String pattern) throws JDOMException, IOException {
Element page = saxBuilder.build(html).getRootElement();
XPathBuilder xpath = new XPathBuilder(pattern, Filters.element());
//...set xpath namespaces...
XPathExpression expr = xpath.compileWith(xpathFactory);
return expr.evaluate(page);
}
Where xpathFactory
is a static member and evaluateHTML
is invoked for every html file we evaluate.