29

I'm looking for a regular expression to isolate the src value of an img. (I know that this is not the best way to do this but this is what I have to do in this case)

I have a string which contains simple html code, some text and an image. I need to get the value of the src attribute from that string. I have managed only to isolate the whole tag till now.

string matchString = Regex.Match(original_text, @"(<img([^>]+)>)").Value;
zekia
  • 4,527
  • 6
  • 38
  • 47
  • Run a second regex on the img tag to get the src attribute – simendsjo Nov 23 '10 at 15:06
  • 3
    Obligatory link to [this related answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Cameron Nov 23 '10 at 15:16

8 Answers8

53
string matchString = Regex.Match(original_text, "<img.+?src=[\"'](.+?)[\"'].*?>", RegexOptions.IgnoreCase).Groups[1].Value;
Hinek
  • 9,519
  • 12
  • 52
  • 74
  • 1
    This Regex will only work if Src is the first attribute of the Image, If Src comes after ID or some other attributes, then it'll not work – Unknown Coder Jul 17 '12 at 19:06
  • 2
    @ShreekumarS why? There is a `.+?` between img and src, so there can be all kinds of characters ... – Hinek Jul 18 '12 at 14:41
  • 3
    This one is Fine `Regex.Match(original_text, "", RegexOptions.IgnoreCase).Groups[1].Value;` – Unknown Coder Jul 19 '12 at 06:54
  • I would make it a little more greedier, .* instead of .+?, certainly for the last one, otherwise you always require 1 minimum character. It might not be there if they just close the img tag right after the src attribute. – Christophe Geers Dec 18 '13 at 07:55
  • It's not a good idea to make this greedy, what if there are more than one img-elements? Your expression might capture all these elements as one match. But you are right about the end of my expression, I changed it to .*? to allow the element to end after the src attribute. The first .+? is still right, there has to be at least one character between img and src: the space ... – Hinek Dec 18 '13 at 10:24
15

I know you say you have to use regex, but if possible i would really give this open source project a chance: HtmlAgilityPack

It is really easy to use, I just discovered it and it helped me out a lot, since I was doing some heavier html parsing. It basically lets you use XPATHS to get your elements.

Their example page is a little outdated, but the API is really easy to understand, and if you are a little bit familiar with xpaths you will get head around it in now time

The code for your query would look something like this: (uncompiled code)

 List<string> imgScrs = new List<string>();
 HtmlDocument doc = new HtmlDocument();
 doc.LoadHtml(htmlText);//or doc.Load(htmlFileStream)
 var nodes = doc.DocumentNode.SelectNodes(@"//img[@src]"); s
 foreach (var img in nodes)
 {
    HtmlAttribute att = img["src"];
    imgScrs.Add(att.Value)
 }
Francisco Noriega
  • 13,725
  • 11
  • 47
  • 72
  • I tried this, but it looks like the HtmlAgilityPack's api has changed. I have posted an alternative solution to this question – eflles Apr 06 '12 at 10:07
7

I tried what Francisco Noriega suggested, but it looks that the api to the HtmlAgilityPack has been altered. Here is how I solved it:

        List<string> images = new List<string>();
        WebClient client = new WebClient();
        string site = "http://www.mysite.com";
        var htmlText = client.DownloadString(site);

        var htmlDoc = new HtmlDocument()
                    {
                        OptionFixNestedTags = true,
                        OptionAutoCloseOnEnd = true
                    };

        htmlDoc.LoadHtml(htmlText);

        foreach (HtmlNode img in htmlDoc.DocumentNode.SelectNodes("//img"))
        {
            HtmlAttribute att = img.Attributes["src"];
            images.Add(att.Value);
        }
eflles
  • 6,606
  • 11
  • 41
  • 55
  • 2
    You should really put //img[@src] in the `SelectNodes` call (or check for its existence before getting the `att.Value`.. And either check the result for null or tack `?? new HtmlNodeCollection(null);` to the call of `SelctNodes`. You'll get `NullReferenceException` otherwise. – jessehouwing Apr 06 '12 at 14:26
  • 1
    Instead of adding a new answer, you could also edit the original answer to remove the errors contained in there. – jessehouwing Apr 06 '12 at 14:30
3

This should capture all img tags and just the src part no matter where its located (before or after class etc) and supports html/xhtml :D

<img.+?src="(.+?)".+?/?>
Fabian
  • 13,603
  • 6
  • 31
  • 53
2

The regex you want should be along the lines of:

(<img.*?src="([^"])".*?>)

Hope this helps.

Niet the Dark Absol
  • 320,036
  • 81
  • 464
  • 592
1

you can also use a look behind to do it without needing to pull out a group

(?<=<img.*?src=")[^"]*

remember to escape the quotes if needed

Seattle Leonard
  • 6,548
  • 3
  • 27
  • 37
0

This is what I use to get the tags out of strings:

</? *img[^>]*>
TheTC
  • 677
  • 9
  • 19
-1

Here is the one I use:

<img.*?src\s*?=\s*?(?:(['"])(?<src>(?:(?!\1).)*)\1|(?<src>[^\s>]+))[^>]*?>

The good part is that it matches any of the below:

<img src='test.jpg'>
<img src=test.jpg>
<img src="test.jpg">

And it can also match some unexpected scenarios like extra attributes, e.g:

<img src = "test.jpg" width="300">
David Niki
  • 1,092
  • 1
  • 11
  • 14