Understanding something in regular expressions

Question

If I use a delimiter on a string:

Scanner scanString = new Scanner(line).useDelimiter("<.*>");

I want to know why this won't preserve the text in

<a href="https://post.craigslist.org/c/snj?lang=en">post to classifieds</a>

but it will in a line with only

<option value="ccc">community

While

Scanner scanString = new Scanner(line).useDelimiter("<.*?>");

will work for both.

As I understand it this "<.*>" should exclude a string starting with "<" followed by any character 0 or more times until it reaches a ">". So shouldn't it not start excluding again until it reaches another "<"?

use a tool like Expresso: http://www.ultrapico.com/Expresso.htm — Mitch Wheat, Feb 11 '12 at 04:45
You may also wish to read http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags and http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not — Dawood ibn Kareem, Feb 11 '12 at 05:59

Sergey Kalinichenko · Accepted Answer · 2012-02-11T05:01:17.603

3

This is because the second expression uses a reluctant (as opposed to greedy) quantifier, which means that it does not attempt to match the entire string and back off from there, like the first one does.

This expression "<.*>" tries to advance as far as possible into your input string, so it goes all the way to the end. Once it's there, it discovers that it has a match, and so it stops. The reluctant version "<.*?>" does not do that: it matches to the first >, and stops.

This article provides a great read on quantifiers.

edited Feb 11 '12 at 05:01

answered Feb 11 '12 at 04:50

Sergey Kalinichenko

714,442
84
1,110
1,523

Wow that made just made so much sense. And that article looks extremely helpful. Thank you! – John Powers Feb 11 '12 at 04:56

Understanding something in regular expressions

1 Answers1