Get two different matches from a string

Question

I have html anchor tag like :

<a href="http://www.stackoverflow.com"><h1><b>Stackoverflow</b></h1></a>

I wrote a regex to get the href value which is:

href="(.+)"

then i wrote a regex to get the link display text, for which regex is:

>(\w+)<

But i am not able to figure out how i can make it work in one regex so that i can extract href value and text together.

How i can achieve that ?

I have tried following but obviously it not works, because it matches for 1 group only with this :

href="(.+)".*>|(\w+)<

Is it your first regex question? If yes, I understand. Note that anyone posting a regex solution for this task risks getting downvotes for the sole idea of using regex with HTML. You can easily experiment with that on your own at the [regexstorm.net](http://regexstorm.net/tester) site. You may even learn balancing groups. And avoid `.*`/`.+` in markup text. — Wiktor Stribiżew, Feb 25 '16 at 20:32
[Relevant](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). @EhsanSajjad Learning Regular Expressions is great. However, I strongly recommend you pick something else to learn on! If you're doing this for something simple, such as this one line, then that's fine. Just a word of warning! — Adam Sears, Feb 25 '16 at 21:33
@AdamSears actually following this series: https://www.hackerrank.com/challenges/detect-html-links — Ehsan Sajjad, Feb 25 '16 at 21:47

slugo · Accepted Answer · 2016-02-26T00:40:09.567

1

If you want to use regex this could work for your example:

href="(.*)".*>([^<]+)<

edited Feb 26 '16 at 00:40

answered Feb 25 '16 at 20:29

slugo

1,019
2
11
22

sweet and simple, understandable Thanks – Ehsan Sajjad Feb 25 '16 at 21:05
it is failing in this case : ``
Example Link
`` – Ehsan Sajjad Feb 25 '16 at 21:34
Ok , I modified it, it was only capturing links that consisted only of alphanumeric characters. – slugo Feb 26 '16 at 00:45

Arturo Menchaca · Answer 2 · 2016-02-25T21:26:44.690

You can use matching groups to capture both the text and the link:

href="(?<link>[^"]+)".*?>(?<text>\w+)<

The basic idea is combine your regular expressions in one: link-regex + SOMETEXT + text-regex.

Grouping allows you define subexpressions of a regular expression and capture the substrings of an input string.

In this text:

<a href="http://www.stackoverflow.com"><h1><b>Stackoverflow</b></h1></a>

We can capture:

href="http://www.stackoverflow.com"><h1><b>Stackoverflow<

Using a regular expression like this: href="[^"]+".*?>\w+<

href="[^"]+" captures the first part (href="http://www.stackoverflow.com").
.*? captures middle text (><h1><b).
>\w+< captures the last part (>Stackoverflow<)

We can capture specific parts of the captured string using groups, which are defined using parenthesis ():

href="[^"]+" => href="([^"]+)"
>\w+< => >(\w+)<

Also, we can name groups using ?<name>

href="([^"]+)" => href="(?<link>[^"]+)"
>(\w+)< => >(?<text>\w+)<

Finally, we can access captured groups using the property match.Groups

var input = "<a href=\"http://www.stackoverflow.com\"><h1><b>Stackoverflow</b></h1></a>";
var pattern = "href=\"(?<link>[^\"]+)\".*?>(?<text>\\w+)<";

var match = Regex.Match(input, pattern);

var link = match.Groups["link"].Value;
var text = match.Groups["text"].Value;

can you please elaborate it as well how it is working? – Ehsan Sajjad Feb 25 '16 at 20:55 — Ehsan Sajjad, Feb 25 '16 at 20:55
specifically usage of ```` and ```` – Ehsan Sajjad Feb 25 '16 at 20:56 — Ehsan Sajjad, Feb 25 '16 at 20:56

Olivier Jacot-Descombes · Answer 3 · 2016-02-25T23:41:09.110

Regex does not work well for paring HTML or XML. This is because they contain nested structures, may contain additional formatting tags and also escaped characters.

By far the best solution is to use the Html Agility Pack. Compared to just treating the HTML as XML, the Html Agility Pack can cope with unclosed tags (like <br>) and other oddities.

If you still want to do it with regex. Then I suggest the following pattern:

href="(.+?)"[^/]*>([^<]+)

It yields the HTML address between the quotes as group 1 and the link text without the surrounding tags in group 2.

It looks like a cat walked over my keyboard. I want to try to dissect it and explain the different parts.

The HTML address must follow href=".

We want to find the HTML address with .+?. This means: one or more characters (.+), but as few as possible (?), because otherwise this might swallow too many characters. We enclose this expression in parentheses in order to catch it as a group.

Then comes the unwanted stuff after the HTML address: "[^/]*>, an " followed by zero or more characters except / followed by >. This swallows all the starting tags up to the last >, but not the ending tags, because those contain a /.

We are almost at the end. Now we search the link text with [^<]+ and catch it in a group again. We search for all characters except <, which makes the search stop at the first ending tag.

He said he already know about it but he's currently learning and experiencing Regex. — Cédric M., Feb 25 '16 at 20:40
Okay. Regex is a fantastic tool in many situations, but is very unruly when it comes to pare XML and HTML. — Olivier Jacot-Descombes, Feb 25 '16 at 20:44

Quinn · Answer 4 · 2016-02-25T21:36:43.933

Another approach:

string input = "<a href=\"http:////www.stackoverflow.com\"><h1><b>Stackoverflow</b></h1></a>";
string pattern = "href=\"([^\"]+)\".*>([^<]+)<";
var result = Regex.Matches(input, pattern).Cast<Match>().ToList().ConvertAll(m => new List<string>() {m.Groups[1].Value, m.Groups[2].Value});

Result is a list array:

[{"http:////www.stackoverflow.com", "Stackoverflow"}]

Regex explains:

href=\"     match href="
([^\"]+)    match all other than " (i.e. http:////www.stackoverflow.com)
\"          match "
.*>         match all until >
([^<]+)     match all other than < (i.e. Stackoverflow)
<           match <

Get two different matches from a string

**([^<>]*?)**

4 Answers4