-1

I have html anchor tag like :

<a href="http://www.stackoverflow.com"><h1><b>Stackoverflow</b></h1></a>

I wrote a regex to get the href value which is:

href="(.+)"

then i wrote a regex to get the link display text, for which regex is:

>(\w+)<

But i am not able to figure out how i can make it work in one regex so that i can extract href value and text together.

How i can achieve that ?

I have tried following but obviously it not works, because it matches for 1 group only with this :

href="(.+)".*>|(\w+)<
Ehsan Sajjad
  • 61,834
  • 16
  • 105
  • 160
  • 2
    Why are you using RegEx? Take a look at HtmlAgilityPack. – Tim Feb 25 '16 at 20:22
  • @Tim i want to do it using Regex, learning regex nowadays – Ehsan Sajjad Feb 25 '16 at 20:22
  • try using matching groups – Jacobr365 Feb 25 '16 at 20:24
  • href="([^"]*?)">

    ([^<>]*?)

    – Gusman Feb 25 '16 at 20:28
  • Is it your first regex question? If yes, I understand. Note that anyone posting a regex solution for this task risks getting downvotes for the sole idea of using regex with HTML. You can easily experiment with that on your own at the [regexstorm.net](http://regexstorm.net/tester) site. You may even learn balancing groups. And avoid `.*`/`.+` in markup text. – Wiktor Stribiżew Feb 25 '16 at 20:32
  • [Relevant](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). @EhsanSajjad Learning Regular Expressions is great. However, I strongly recommend you pick something else to learn on! If you're doing this for something simple, such as this one line, then that's fine. Just a word of warning! – Adam Sears Feb 25 '16 at 21:33
  • @AdamSears actually following this series: https://www.hackerrank.com/challenges/detect-html-links – Ehsan Sajjad Feb 25 '16 at 21:47

4 Answers4

1

If you want to use regex this could work for your example:

href="(.*)".*>([^<]+)<

slugo
  • 1,019
  • 2
  • 11
  • 22
1

You can use matching groups to capture both the text and the link:

href="(?<link>[^"]+)".*?>(?<text>\w+)<

The basic idea is combine your regular expressions in one: link-regex + SOMETEXT + text-regex.

Grouping allows you define subexpressions of a regular expression and capture the substrings of an input string.

In this text:

<a href="http://www.stackoverflow.com"><h1><b>Stackoverflow</b></h1></a>

We can capture:

href="http://www.stackoverflow.com"><h1><b>Stackoverflow<

Using a regular expression like this: href="[^"]+".*?>\w+<

  • href="[^"]+" captures the first part (href="http://www.stackoverflow.com").
  • .*? captures middle text (><h1><b).
  • >\w+< captures the last part (>Stackoverflow<)

We can capture specific parts of the captured string using groups, which are defined using parenthesis ():

  • href="[^"]+" => href="([^"]+)"
  • >\w+< => >(\w+)<

Also, we can name groups using ?<name>

  • href="([^"]+)" => href="(?<link>[^"]+)"
  • >(\w+)< => >(?<text>\w+)<

Finally, we can access captured groups using the property match.Groups

var input = "<a href=\"http://www.stackoverflow.com\"><h1><b>Stackoverflow</b></h1></a>";
var pattern = "href=\"(?<link>[^\"]+)\".*?>(?<text>\\w+)<";

var match = Regex.Match(input, pattern);

var link = match.Groups["link"].Value;
var text = match.Groups["text"].Value;
Arturo Menchaca
  • 15,783
  • 1
  • 29
  • 53
0

Regex does not work well for paring HTML or XML. This is because they contain nested structures, may contain additional formatting tags and also escaped characters.

By far the best solution is to use the Html Agility Pack. Compared to just treating the HTML as XML, the Html Agility Pack can cope with unclosed tags (like <br>) and other oddities.


If you still want to do it with regex. Then I suggest the following pattern:

href="(.+?)"[^/]*>([^<]+)

It yields the HTML address between the quotes as group 1 and the link text without the surrounding tags in group 2.

It looks like a cat walked over my keyboard. I want to try to dissect it and explain the different parts.

The HTML address must follow href=".

We want to find the HTML address with .+?. This means: one or more characters (.+), but as few as possible (?), because otherwise this might swallow too many characters. We enclose this expression in parentheses in order to catch it as a group.

Then comes the unwanted stuff after the HTML address: "[^/]*>, an " followed by zero or more characters except / followed by >. This swallows all the starting tags up to the last >, but not the ending tags, because those contain a /.

We are almost at the end. Now we search the link text with [^<]+ and catch it in a group again. We search for all characters except <, which makes the search stop at the first ending tag.

Olivier Jacot-Descombes
  • 104,806
  • 13
  • 138
  • 188
0

Another approach:

string input = "<a href=\"http:////www.stackoverflow.com\"><h1><b>Stackoverflow</b></h1></a>";
string pattern = "href=\"([^\"]+)\".*>([^<]+)<";
var result = Regex.Matches(input, pattern).Cast<Match>().ToList().ConvertAll(m => new List<string>() {m.Groups[1].Value, m.Groups[2].Value});

Result is a list array:

[{"http:////www.stackoverflow.com", "Stackoverflow"}]

Regex explains:

href=\"     match href="
([^\"]+)    match all other than " (i.e. http:////www.stackoverflow.com)
\"          match "
.*>         match all until >
([^<]+)     match all other than < (i.e. Stackoverflow)
<           match <
Quinn
  • 4,394
  • 2
  • 21
  • 19