-1

I need to parse my HTML page to replace some links, this is the form of a link <a href="/{localLink:1144}" title="Bas-rhin">Mauris nec</a>. The problem is that my regex expression doesn't end properly, I think it's because of the ".

This is my Regex expression :

Regex r= new Regex("<a href=\"(/{localLink:)(.*)}\" title=\"(.*)\">(.*)</a>");

That regex doesn't end after each link, and the third group doesn't contain the title property but almost all the html until the last of my html.

I tested it with this site :

http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx

enter image description here

So, why doesn't the third group end directly after Bas-Rhin" ?

Stephane Mathis
  • 6,542
  • 6
  • 43
  • 69
  • ...particularly the first answer... – David M Aug 12 '13 at 15:40
  • I think this question is relevant, do check the marked answer: [Using regular expressions to parse HTML: why not?](http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not) – Geeky Guy Aug 12 '13 at 15:41
  • @stephane without testing or reading your regex: replace `.*` with `.*?`, it will make it ungreedy. Also don't forget to escape `{}` – HamZa Aug 12 '13 at 15:50
  • are you sure you haven't confounded the text boxes when testing ? your pattern works fine with me on derekslager, using the option '`CultureInvariant`'. – collapsar Aug 12 '13 at 16:12

3 Answers3

1

The answer to the question you asked ("So, why doesn't the third group end directly after Bas-Rhin"?") is that .* is greedy, which means it will consume as much as possible. Replace it with .*? to make it consume as little as possible.

The answer to many questions you're probably going to have if you keep going down this path is that regular expressions cannot correctly parse HTML, as HTML is not a regular language. If you have a language with nested matching tokens (such as <tag> matched with </tag> or { matched with }) and no limit to nesting depth (which is the case in HTML, C-family languages, JSON, and many others), regular expressions simply will not work to parse or validate it.

Eric Finn
  • 8,629
  • 3
  • 33
  • 42
  • this answer doesn't seem to be correct since the pattern sports a literal `"` ending the match-all subatterns for the values of `href`, `title` attributes, thus there will be no unwanted greedy matching. – collapsar Aug 12 '13 at 16:11
  • @collapsar Actually, if there is any other tag with a `"` right before the `>`, and there is a `` anywhere after that, it'll consume until it matches that. So the pattern would match all of `Text More text here. google` – Eric Finn Aug 12 '13 at 16:16
  • this is correct, of course, but doesn't apply to the example case given by the OP. best practice would of course be to force ungreedy matching or limit the permissible characters for the attribute values (ie. using `[^"]*` instead of `.*`). – collapsar Aug 12 '13 at 16:24
1
Regex r= new Regex("<a href=\"(/{localLink:)(.*)}\" title=\"(.*)\">(.*)</a>");

doesn't work as expected because quantifiers (*) are greedy by default, that means they catch all they can (the most possible).

To solve the problem, you have several ways:

1 the most obvious:

make your quantifiers lazy by adding a question mark: (.*?)

2 the most efficient:

don't use the dot and use a negated character class instead. Example:

Regex r= new Regex("<a href=\"(/{localLink:)([^}]*)}\" title=\"([^"]*)\">(.*?)</a>");

The last (.*?) can be replaced by:

((?>[^<]+|<(?!/a>)*)

3 the most reasonable:

use agilitypack or an other html parser to extract all "a" tags. you can check after if the href is like you want. (Note that with xpath you can perform this check directly in one step)

Xpath query example:

//a[contains(@href, '{localLink:')]
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
0

your test case appears to be fine:

see here http://collapsar.ohost.de/pics/derek.png

collapsar
  • 17,010
  • 4
  • 35
  • 61