Regular Expression to get the SRC of images in C#

Question

I'm looking for a regular expression to isolate the src value of an img. (I know that this is not the best way to do this but this is what I have to do in this case)

I have a string which contains simple html code, some text and an image. I need to get the value of the src attribute from that string. I have managed only to isolate the whole tag till now.

string matchString = Regex.Match(original_text, @"(<img([^>]+)>)").Value;

Obligatory link to [this related answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — Cameron, Nov 23 '10 at 15:16

Hinek · Accepted Answer · 2013-12-18T10:21:01.747

53

string matchString = Regex.Match(original_text, "<img.+?src=[\"'](.+?)[\"'].*?>", RegexOptions.IgnoreCase).Groups[1].Value;

edited Dec 18 '13 at 10:21

answered Nov 23 '10 at 15:10

Hinek

9,519
12
52
74

1

This Regex will only work if Src is the first attribute of the Image, If Src comes after ID or some other attributes, then it'll not work – Unknown Coder Jul 17 '12 at 19:06
2

@ShreekumarS why? There is a `.+?` between img and src, so there can be all kinds of characters ... – Hinek Jul 18 '12 at 14:41
3

This one is Fine `Regex.Match(original_text, "", RegexOptions.IgnoreCase).Groups[1].Value;` – Unknown Coder Jul 19 '12 at 06:54
I would make it a little more greedier, .* instead of .+?, certainly for the last one, otherwise you always require 1 minimum character. It might not be there if they just close the img tag right after the src attribute. – Christophe Geers Dec 18 '13 at 07:55
It's not a good idea to make this greedy, what if there are more than one img-elements? Your expression might capture all these elements as one match. But you are right about the end of my expression, I changed it to .*? to allow the element to end after the src attribute. The first .+? is still right, there has to be at least one character between img and src: the space ... – Hinek Dec 18 '13 at 10:24

score 15 · Answer 2 · answered Nov 23 '10 at 18:27

I know you say you have to use regex, but if possible i would really give this open source project a chance: HtmlAgilityPack

It is really easy to use, I just discovered it and it helped me out a lot, since I was doing some heavier html parsing. It basically lets you use XPATHS to get your elements.

Their example page is a little outdated, but the API is really easy to understand, and if you are a little bit familiar with xpaths you will get head around it in now time

The code for your query would look something like this: (uncompiled code)

 List<string> imgScrs = new List<string>();
 HtmlDocument doc = new HtmlDocument();
 doc.LoadHtml(htmlText);//or doc.Load(htmlFileStream)
 var nodes = doc.DocumentNode.SelectNodes(@"//img[@src]"); s
 foreach (var img in nodes)
 {
    HtmlAttribute att = img["src"];
    imgScrs.Add(att.Value)
 }

I tried this, but it looks like the HtmlAgilityPack's api has changed. I have posted an alternative solution to this question — eflles, Apr 06 '12 at 10:07

score 7 · Answer 3 · answered Apr 06 '12 at 10:05

I tried what Francisco Noriega suggested, but it looks that the api to the HtmlAgilityPack has been altered. Here is how I solved it:

        List<string> images = new List<string>();
        WebClient client = new WebClient();
        string site = "http://www.mysite.com";
        var htmlText = client.DownloadString(site);

        var htmlDoc = new HtmlDocument()
                    {
                        OptionFixNestedTags = true,
                        OptionAutoCloseOnEnd = true
                    };

        htmlDoc.LoadHtml(htmlText);

        foreach (HtmlNode img in htmlDoc.DocumentNode.SelectNodes("//img"))
        {
            HtmlAttribute att = img.Attributes["src"];
            images.Add(att.Value);
        }

You should really put //img[@src] in the `SelectNodes` call (or check for its existence before getting the `att.Value`.. And either check the result for null or tack `?? new HtmlNodeCollection(null);` to the call of `SelctNodes`. You'll get `NullReferenceException` otherwise. — jessehouwing, Apr 06 '12 at 14:26
Instead of adding a new answer, you could also edit the original answer to remove the errors contained in there. — jessehouwing, Apr 06 '12 at 14:30

score 3 · Answer 4 · answered Nov 23 '10 at 15:06

3

This should capture all img tags and just the src part no matter where its located (before or after class etc) and supports html/xhtml :D

<img.+?src="(.+?)".+?/?>

answered Nov 23 '10 at 15:06

Fabian

13,603
6
31
53

score 2 · Answer 5 · answered Nov 23 '10 at 15:06

2

The regex you want should be along the lines of:

(<img.*?src="([^"])".*?>)

Hope this helps.

answered Nov 23 '10 at 15:06

Niet the Dark Absol

320,036
81
464
592

score 1 · Answer 6 · answered Nov 23 '10 at 17:39

1

you can also use a look behind to do it without needing to pull out a group

(?<=<img.*?src=")[^"]*

remember to escape the quotes if needed

answered Nov 23 '10 at 17:39

Seattle Leonard

6,548
3
27
37

score 0 · Answer 7 · answered Mar 03 '15 at 20:44

0

This is what I use to get the tags out of strings:

</? *img[^>]*>

answered Mar 03 '15 at 20:44

TheTC

677
9
19

David Niki · Answer 8 · 2017-07-24T07:56:58.480

-1

Here is the one I use:

<img.*?src\s*?=\s*?(?:(['"])(?<src>(?:(?!\1).)*)\1|(?<src>[^\s>]+))[^>]*?>

The good part is that it matches any of the below:

<img src='test.jpg'>
<img src=test.jpg>
<img src="test.jpg">

And it can also match some unexpected scenarios like extra attributes, e.g:

<img src = "test.jpg" width="300">

edited Jul 24 '17 at 07:56

answered Jul 24 '17 at 07:49

David Niki

1,092
1
11
14

Regular Expression to get the SRC of images in C#

8 Answers8

Linked

Related