4

If I use a delimiter on a string:

Scanner scanString = new Scanner(line).useDelimiter("<.*>");

I want to know why this won't preserve the text in

<a href="https://post.craigslist.org/c/snj?lang=en">post to classifieds</a>

but it will in a line with only

<option value="ccc">community

While

Scanner scanString = new Scanner(line).useDelimiter("<.*?>");

will work for both.

As I understand it this "<.*>" should exclude a string starting with "<" followed by any character 0 or more times until it reaches a ">". So shouldn't it not start excluding again until it reaches another "<"?

Michael Petrotta
  • 59,888
  • 27
  • 145
  • 179
John Powers
  • 267
  • 5
  • 13
  • use a tool like Expresso: http://www.ultrapico.com/Expresso.htm – Mitch Wheat Feb 11 '12 at 04:45
  • You may also wish to read http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags and http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not – Dawood ibn Kareem Feb 11 '12 at 05:59

1 Answers1

3

This is because the second expression uses a reluctant (as opposed to greedy) quantifier, which means that it does not attempt to match the entire string and back off from there, like the first one does.

This expression "<.*>" tries to advance as far as possible into your input string, so it goes all the way to the end. Once it's there, it discovers that it has a match, and so it stops. The reluctant version "<.*?>" does not do that: it matches to the first >, and stops.

This article provides a great read on quantifiers.

Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523