Regex does not work well for paring HTML or XML. This is because they contain nested structures, may contain additional formatting tags and also escaped characters.
By far the best solution is to use the Html Agility Pack. Compared to just treating the HTML as XML, the Html Agility Pack can cope with unclosed tags (like <br>
) and other oddities.
If you still want to do it with regex. Then I suggest the following pattern:
href="(.+?)"[^/]*>([^<]+)
It yields the HTML address between the quotes as group 1 and the link text without the surrounding tags in group 2.
It looks like a cat walked over my keyboard. I want to try to dissect it and explain the different parts.
The HTML address must follow href="
.
We want to find the HTML address with .+?
. This means: one or more characters (.+
), but as few as possible (?
), because otherwise this might swallow too many characters. We enclose this expression in parentheses in order to catch it as a group.
Then comes the unwanted stuff after the HTML address: "[^/]*>
, an "
followed by zero or more characters except /
followed by >
. This swallows all the starting tags up to the last >
, but not the ending tags, because those contain a /
.
We are almost at the end. Now we search the link text with [^<]+
and catch it in a group again. We search for all characters except <
, which makes the search stop at the first ending tag.
([^<>]*?)
– Gusman Feb 25 '16 at 20:28