0

I have HTML code from which I want to parse values for hyperlinks, and I wish to use regular expressions. The code from whole page can be found in the attached html below:

http://dl.dropbox.com/u/4571235/example.html

I want to get the hyperlink after each 'compare prices' button in the document.

halfer
  • 19,824
  • 17
  • 99
  • 186
Laziale
  • 7,965
  • 46
  • 146
  • 262
  • 3
    Maybe read this first: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Elias Van Ootegem Apr 24 '12 at 18:37
  • The supplied link is now 404, and thus it means the question would be best marked as off-topic/on-hold. – halfer Jun 05 '22 at 19:23

3 Answers3

1

check here.

and try this code:

public static bool isValidUrl(ref string url)
{
    string pattern = @"^(http|https|ftp)\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?/?([a-zA-Z0-9\-\._\?\,\'/\\\+&%\$#\=~])*[^\.\,\)\(\s]$";
    Regex reg = new Regex(pattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);
    return reg.IsMatch(url);
}
Mitja Bonca
  • 4,268
  • 5
  • 24
  • 30
  • I want to get only those links for compare prices button. Not all the links on the form. Is that possible? Thanks – Laziale Apr 24 '12 at 18:40
0

I see that there are also other URLs in the source code - I can suggest the following regex, but it will work correctly ONLY IF each 'compare prices' text is followed directly by the url that you are interested in (i.e. if there is no other url between the 'correct' one). If there is a 'compare prices' text without a matching url the regex will need changed based on some rules.

value="Compare prices"(?:.*?)<a\s+href="([^"]*?)"

The url will be in the matching group 1.

Joanna Derks
  • 4,033
  • 3
  • 26
  • 32
0

Usually a link is in an "a tag", or an "a link" or "img src="url".
If it is in an a href tag you could just check for valid a href and then perform the validation on just those for starters...
0. First get all the inner html in the form that your buttons are contained in.
1. Then grab up just the a href tags for further inspection... pattern="<a[^>]*>" or pattern="<link[^>]*>" or pattern="<img[^>]*>"
2. Then for each of the tags pull out the link, src and href tags
3. Then check to see if the url is valid.
Note: if you can do step 0 then you can most linkly just get all the attributes of a given type and then perform a regular expression on them as well.

RetroCoder
  • 2,597
  • 10
  • 52
  • 81