5

I am trying to extract href and src links from an HTML string. According to this post, I was able to get the image portion. Can anyone help adjust the regular expression to include the href URL in the collection too?

public List<string> GetLinksFromHtml(string content)
{
    string regex = @"<img[^>]*?src\s*=\s*[""']?([^'"" >]+?)[ '""][^>]*?>";
    var matches = Regex.Matches(content, regex, RegexOptions.IgnoreCase | RegexOptions.Singleline);
    var links = new List<string>();

    foreach (Match item in matches)
    {
        string link = item.Groups[1].Value;
        links.Add(link);
    }

    return links;
}
Community
  • 1
  • 1
TruMan1
  • 33,665
  • 59
  • 184
  • 335
  • Why don't you just use a regular [HTML parser](http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c) for this? – Roman Nov 09 '11 at 14:15
  • 1
    I do not want to have to deploy an extra library. It is supposed to be a light and quick method I can easily drop into any project. – TruMan1 Nov 09 '11 at 14:17
  • 7
    You should think about using a library. For a one-time hack Regex and HTML might work, but if you will use this more frequently, you will some day try to parse html with evil comments and embedded javascript and everything will explode. – Jens Nov 09 '11 at 14:23
  • 1
    A regular expression is a bad choice for wild HTML. However, if you can 100% guarantee that the tags you'll be parsing are your own and are completely valid image tags you may be able to get away with it. Regardless, HtmlAgilityPack is the perfect solution. – Mike B Nov 09 '11 at 14:35

5 Answers5

11

Okie Doke! Without "an extra library", and "quick and light", here ya go:

<(?<Tag_Name>(a)|img)\b[^>]*?\b(?<URL_Type>(?(1)href|src))\s*=\s*(?:"(?<URL>(?:\\"|[^"])*)"|'(?<URL>(?:\\'|[^'])*)')

or as a C# string:

@"<(?<Tag_Name>(a)|img)\b[^>]*?\b(?<URL_Type>(?(1)href|src))\s*=\s*(?:""(?<URL>(?:\\""|[^""])*)""|'(?<URL>(?:\\'|[^'])*)')"

This captures the tag name (a or img) into the group "Tag_Name", the URL type (href or src) into the group "URL_Type", and the URL into the group "URL" (I know, I got a bit creative with the group names).

It handles either type of quotes (" or '), and even though any type of quotes in a URL should already be encoded into entities, it will ignore any single-escaped quote characters \' and \".

It does not ignore unclosed tags (therefore malformed HTML), it will find an opening for one of the tags such as <a or img, then proceed to ignore everything except a greater than (>) up until it finds the matching URL type of attribute (href for a tags and src for img tags), then match the contents. It then quits and does not worry about the rest of the tag!

Let me know if you'd like me to break it down for you, but here is a sampling of the matches it made for this very page:

<Match>                                  'Tag' 'URL_Type' 'URL'
---------------------------------------- ----- ---------- -----------------------------
<a href="http://meta.stackoverflow.com"   a     href      http://meta.stackoverflow.com
<a href="/about"                          a     href      /about
<a href="/faq"                            a     href      /faq
<a href="/"                               a     href      /
<a id="nav-questions" href="/questions"   a     href      /questions
...
<img src="/posts/8066248/ivc/d499"        img   src       /posts/8066248/ivc/d499

It found a total of 140 tags (I am assuming additional posters will increase this somewhat)

Code Jockey
  • 6,611
  • 6
  • 33
  • 45
0

Below code can help you get every link in html, after get them you can get more detail element in link:

string html = "123<a href=\"http://www.codeios.com/home.php\">123123</a>789";
Regex r = new Regex(@"<a.*?href=(""|')(?<href>.*?)(""|').*?>(?<value>.*?)</a>");

foreach (Match match in r.Matches(html))
{
    string url = match.Groups["href"].Value;
    string text = match.Groups["value"].Value;

    Response.Write(url + text);
}
Wilson Wu
  • 1,790
  • 19
  • 13
0

I just sketched this up real quick Regex Expression, But tested and working, tell me if this suits your needs. (url and img are name-grouped so they'll be easy to retrieve)

<a(.*?)href="(?P<url>.*?)"(.*?)><img(.*)src="(?P<img>.*?)"(.*?)></a>

You could also make it catch images without link by adding the ? sign for the <a> and </a> tags, as follows:

(<a(.*?)href="(?P<url>.*?)"(.*?)>)?(<img(.*)src="(?P<img>.*?)"(.*?)>)(</a>)?

Shai

Shai Mishali
  • 9,224
  • 4
  • 56
  • 83
  • That won't quite work, HTML allows for both single and double quotes for attributes. – Roman Nov 09 '11 at 14:30
  • Its no trouble to set either with (\'|\") :) I'm a huge discourager of regex for html parsing, but he asked for a specific solution which i tried to provide. – Shai Mishali Nov 09 '11 at 14:37
  • This is for .Net, I believe - as far as I know, .Net does not support the `(?P...)` group naming construct - have they changed this? or have I always been wrong? – Code Jockey Nov 09 '11 at 15:56
  • I'm not from the .NET world :) I just gave my shot of Regular Expressions. He can just change the. I saw someone here using the (?<>) syntax, so in that case, just removing the P would do the job. – Shai Mishali Nov 09 '11 at 16:02
  • 2
    @CodeJockey: .NET regexes [support](http://stackoverflow.com/questions/906493/regex-named-capturing-groups-in-net) named capture groups. Syntax is also nearly correct, just needs to be without the `P`. – Roman Nov 09 '11 at 18:51
  • @R0MANARMY I think this might be a matter of bad phrasing of my comment above. Yes - [.Net does support them](http://msdn.microsoft.com/en-us/library/30wbz966(v=vs.71).aspx) - just not the `(?P...)` style (specifically - as you noted - the `P` part of the construct) Instead, it recognizes the form `(?...)` (as used in my answer) and also the `(?'xxx'...)` which in my opinion is atypical, non-"standard", and more confusing because it's harder to read in a complex expression. Thus, it should be burned in effigy (sigh... if only that were possible). Nonetheless, it recognizes it! – Code Jockey Nov 09 '11 at 20:34
0

So monstrous! Because parsing of html by regular expressions is evil

 <img[^>]*?src\s*=\s*[""']?([^'"" >]+?)[ '""][^>]*?href\s*=\s*[""']?([^'"" >]+?)[ '""][^>]*?>
Vitaly Slobodin
  • 1,359
  • 12
  • 10
  • The only threat of evil in using regex to parse SGML is if you're trying to parse SGML nested within other SGML - if there's no need to acknowledge nesting (like with say, image and anchor tags in HTML???) then there's no problem! – Code Jockey Nov 09 '11 at 16:00
-1

There are several places in which link and image can be found.

-Link
    -href
        (?<AttributeName>(?:href))\s*=\s*["'](?<AttributeValue>(?:[^"'])*)
        for c# = (?<AttributeName>(?:href))\s*=\s*[""'](?<AttributeValue>(?:[^""'])*)

check here

-Image
    -Image_DirectSource
        -src
        -background
            (?<AttributeName>(?:src|background))\s*=\s*["'](?<AttributeValue>(?:[^"'])*)
            for c# = (?<AttributeName>(?:src|background))\s*=\s*[""'](?<AttributeValue>(?:[^""'])*)

check here

    _Image_IndirectSource   
        -style
            -background:url()
            background\s*:\s*url\s*\(\s*(?<AttributeValue>(?:[^)])*)

check here

Frank Myat Thu
  • 4,448
  • 9
  • 67
  • 113