Performance of Jsoup vs regexes vs XPath for extracting content from HTML?

Question

I know that in common case HTML shouldn't be parsed with regex.

But I want to make a performance test for web application. I know for sure how HTML may look like. So I can use regexes to extract some data from page source.

As I do performance test (using Jmeter), I want to take less resources from master machine.

What option will be less resource intensive: XPath, regexes (Jakarta ORO) or Jsoup?

score 3 · Accepted Answer · edited Dec 18 '12 at 07:57

As of JMeter 2.8, the answer is Regexp. But it depends of course on Regexp expressions you use. Regexp implementation in JMeter is rather optimized and the main post processing way for correlation.

Regarding JSoup, it would need custom coding based on JSR223 post processor for example.

JMeter 2.9 will introduce a new CSS/JQuery selector based Extractor with 2 possible underlying implementations:

JSOUP
Jodd Lagarto (CSSelly)

See :

https://issues.apache.org/bugzilla/show_bug.cgi?id=54259

Its performance will be lower than Regexp as it builds a DOM document, but it eases much syntax in Test Plans that don't require ultra-optimised Test Plans.

Finally, regarding XPath, as it builds a DOM Tree:

http://www.developer.com/xml/article.php/3397691/Does-StAX-Belong-in-Your-XML-Toolbox.htm

It has a memory and CPU cost which is higher than regex particularly if you want to extract many elements, an enhancement has been created:

https://issues.apache.org/bugzilla/show_bug.cgi?id=53973

Performance of Jsoup vs regexes vs XPath for extracting content from HTML?

1 Answers1

Linked