0

I have found useful regex expressions from the site, but this particular one eludes me.

Basically, I need to extract this:

/uploadedimages/space earth nasa hd wallpapers 62.jpg?n=6965

from this string using regex:

<p>test <a href=\"http://www.hotmail.com?id=1\" title=\"james\">james</a> <a href=\"http://mail.gmail.com/index.asp?id=1\" title=\"lafferty\">lafferty</a> <a href=\"https://mail.google.com/index.asp?id=1\" title=\"joseph\">joseph</a> <strong>swami</strong> is a <a href=\"http://mail.yahoo.com/tests?id=1\" title=\"great\">great</a> guy.<img src=\"/uploadedimages/space earth nasa hd wallpapers 62.jpg?n=6965\" alt=\"nasa1\" title=\"nasa1\" style=\"width: 100px; height: 57px; \" width=\"100\" height=\"57\" /></p>\r\n<p><br /></p>\r\n<p><br /></p>

The regex expression I have extracts the URL without the query string. It is ok if the regex hard codes the string '/uploadedimages/'. However, other than this hard-coding, everything else needs to be generic. This could be anything - not just an image, could be an href linked to a pdf file. Query string could be anything valid as well.

Other regex expressions I have found work only with the absolute URLs starting with http, etc.

LarsTech
  • 80,625
  • 14
  • 153
  • 225
Vijay
  • 119
  • 5
  • Don't use regex. You have the power of .NET with you with tones of more robust ways to handle html. Use that instead. – FailedDev Dec 15 '11 at 21:10
  • Why not use regex instead of an HTML parser? - why go through the DOM for this? - I just need the URLs. – Vijay Dec 15 '11 at 21:11
  • What if the url is malformed. What if an ending /a> tag is missing. What if. Million of what if's. Then your code breaks. Your client is unsatisfied and you are unemployed. – FailedDev Dec 15 '11 at 21:13
  • The URL cannot be malformed as it comes from an HTML generator which is not manually entered. If it were manually malformed, then right now it is ok to just ignore it (that is the requirement). If the tags are malformed - exactly why I would NOT want to use an HTML parser. – Vijay Dec 15 '11 at 21:16
  • Attempting to parse HTML with regular expressions [is not recommended](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). – Keith Thompson Dec 15 '11 at 21:29
  • Unless what you are trying to parse is a relatively small sized text/ html fragment with links inside it, and you still want to extract the URLs from it. – Vijay Dec 15 '11 at 21:51

3 Answers3

1

I am not sure why nobody was able to provide an acceptable answer for this question. As this would be a very real problem for any developer who needs to extract URLs of any kind fully from an HTML fragment which may or may not be valid HTML, here is the answer which I have verified as working in C#:

matches = Regex.Matches(target, "(?<=\")(http:|https:)?[/\\\\](?:[A-Za-z0-9-._~!$&'()*+,;=:@ ]|%[0-9a-fA-F]{2})*([/\\\\](?:([A-Za-z0-9-._~!$&'()*+,;=:@ ]|%[0-9a-fA-F]{2}))*)*(?:\\?[a-zA-Z0-9=/\\\\&]+)?(?=\")", RegexOptions.IgnoreCase);

This will extract any number of URLs in the HTML fragment with query string, and I have also gone ahead and modified the REGEX so that it works properly with escape characters in C# regex. The pure REGEX will not work as-is in C# as we have to escape the "\" and """ characters.

musefan
  • 47,875
  • 21
  • 135
  • 185
Vijay
  • 119
  • 5
0

I'd recommend doing this in stages, since it will be much simpler. You can use .net in a cleaner way, regexes are not needed here, and neither is a full dom parser if you know the format the data will come in. Assuming for the moment that what you really want is the relative url of the image source, and that there is only ever one image in the html, I would recommend something like the following.

string Parse(string html)
{
    var temp = html.Substring(html.IndexOf("src=") + 5);
    return temp.Substring(0, temp.IndexOf("\""));
}

To do it using regular expressions, based off kgoedtel's answer (modified slightly) you'll need to do something like:

string Parse(string html)
{
    var r = new Regex("<img [^=<>]+=\\\\?\"([^\\\\\"]+)");
    return r.Match(html).Groups[1].Value;
}
IEnumerable<string> ParseMany(string html)
{
    var r = new Regex("[^=<>]+=\\\\?\"([^\\\\\"]+)");
    return r.Matches(html).OfType<Match>().Select(m=>m.Groups[1].Value);
}
ForbesLindesay
  • 10,482
  • 3
  • 47
  • 74
  • "if you know the format the data will come in" - nope I don't know the format it will come in. The HTML fragment is created by an editor and the end user can input anything, including images, hrefs, can even manually enter a URL within it. – Vijay Dec 17 '11 at 17:44
  • @Vijay I mean if you know enough about the format it will come in, for example, are you looking for multiple results or just one? Do you know the result is a relative url? or could it be absolute? Do you know you want the URL of the source of an image tag or could it be any url on the page? – ForbesLindesay Dec 19 '11 at 15:40
0

Assuming you want a regex like this?

<([^=<>]+)=\\?"([^\\"]+)

Otherwise, please be less ambiguous about what you are actually trying to parse out. Thanks!

kgoedtel
  • 31
  • 2