2

I am using the following regex to get the src value of the first img tag in an HTML document.

string match = "src=(?:\"|\')?(?<imgSrc>[^>]*[^/].(?:jpg|png))(?:\"|\')?"

Now it captures total src attribute that I dont need. I just need the url inside the src attribute. How to do it?

GEOCHET
  • 21,119
  • 15
  • 74
  • 98
Tanmoy
  • 44,392
  • 16
  • 45
  • 55

3 Answers3

6

Parse your HTML with something else. HTML is not regular and thus regular expressions aren't at all suited to parsing it.

Use an HTML parser, or an XML parser if the HTML is strict. It's a lot easier to get the src attribute's value using XPath:

//img/@src

XML parsing is built into the System.Xml namespace. It's incredibly powerful. HTML parsing is a bit more difficult if the HTML isn't strict, but there are lots of libraries around that will do it for you.

Welbog
  • 59,154
  • 9
  • 110
  • 123
  • 2
    he's not looking to parse html, rather to simply extract a value from a single type of tag in html. Regexes excel at this sort of thing. – Edward Q. Bridges Jun 29 '09 at 15:23
  • 1
    @eqbridges: The fact that the regex he's come up with is so complicated is an indication that it's the wrong way of going about the problem. Then there's the fact that it doesn't match all possible values for the src attributes (i.e. ones containing ' or "). Don't parse HTML/XML this way! Just don't do it! – Welbog Jun 29 '09 at 15:25
  • 1
    @Welbog -- if he only needs to get out a value of the img src, I respectfully disagree. Wielding an HTML parser on a task like that is overkill. If he needs to do anything particularly complex, then I'd be more likely to agree. – Edward Q. Bridges Jun 29 '09 at 15:27
  • 1
    @eqbridges: You call it overkill, I call it simplicity. "//img/@src" is much simpler, readable and maintainable than "src=(?:\"|\')?(?[^>]*[^/].(?:jpg|png))(?:\"|\')?", and above all it's actually correct. – Welbog Jun 29 '09 at 15:33
  • Does anyone have a good example of a tool that's not specific to c#? – Jeff Davis Jun 29 '09 at 16:16
  • 1
    @Jeff Davis: XPath, XQuery and XSL are all XML-related and not tied to any other programming languages. – Welbog Jun 29 '09 at 16:23
4

see When not to use Regex in C# (or Java, C++ etc) and Looking for C# HTML parser

PS, how can I put a link to a StackOverflow question in a comment?

Community
  • 1
  • 1
Ian Ringrose
  • 51,220
  • 55
  • 213
  • 317
1

Your regex should (in english) match on any character after a quote, that is not a quote inside an tag on the src attribute.

In perl regex, it would be like this:

/src=[\"\']([^\"\']+)/

The URL will be in $1 after running this.

Of course, this assumes that the urls in your src attributes are quoted. You can modify the values in the [] brackets accordingly if they are not.

Edward Q. Bridges
  • 16,712
  • 8
  • 35
  • 42
  • Worked beautifully for me. My requirement was simply to extract a sub string using a very specific pattern. The fact that the source string happens to be HTML is irrelevant. I'm not trying to parse HTML and I agree with the above commenter's that whipping out a full HTML parser to do this simple task is overkill. – djskinner Mar 30 '11 at 15:54