0

Hi everyone I'm quite new to regular expressions and I'm trying to get srcs values out of img tags in html webpages, so I've made this regular expression: @"<img.*src *=*([\x22\x27])(?<path>.+)(\1).*/>"
But when I try to get the value frome the group "path" with this sample tag:
<img src='kkkkkk' class='icon' alt='' />
I get kkkkkk' class='icon' alt=' instead of just kkkkkk. I just can't figure it out.
Here is the code I'm using to exctract and print the data:

Regex SrcRegex = new Regex(@"<img.*src *=*([\x22\x27])(?<path>.+)(\1).*/>", RegexOptions.IgnoreCase);

string TestTag = "<img src='kkkkkk' class='icon' alt='' />";

MatchCollection MatchedString = SrcRegex.Matches(ReadIn);

foreach (Match M in MatchedString)
        Console.WriteLine(M.Groups["path"].Value);

Thanks guys for the attention and excuse me for my English.

aevitas
  • 3,753
  • 2
  • 28
  • 39
  • 1
    One of the most popular answers of SO http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – I4V Jul 29 '13 at 20:36
  • 2
    if you want to parse HTML, you are better off going with the HTMLAgilityPack – Keith Nicholas Jul 29 '13 at 20:37

2 Answers2

1

When dealing with html, it is better to use an html parser instead of regex. For example using HtmlAgilityPack

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlstring);

var imgUrls = doc.DocumentNode.Descendants("img")
                .Select(img => img.Attributes["src"].Value)
                .ToList();
I4V
  • 34,891
  • 6
  • 67
  • 79
1

To answer in regex terms, the problem is simply that you're using a greedy quantifier in (?<path>.+), so it matches to the last quote, not the next one, as you intend. Just make it non-greedy:

Regex SrcRegex = new Regex(@"<img.*src *= *([\x22\x27])(?<path>.+?)(\1).*/>", RegexOptions.IgnoreCase);

BTW, I added a space after the =, because I take it that's what you intended. You want to require the =, and optionally match spaces after it, right? The way you had it would match zero or more = signs, with no spaces allowed between the = and the opening quote.

Adi Inbar
  • 12,097
  • 13
  • 56
  • 69