0

I started learning REGEX 2 days ago , now id like to make a small application that read the source code of a webpage and get webpages like http://page.com or http://www.page.com/sub/sub/sub?=value , etc..... , stuff like that , anyway that`s the code i typed :

Regex r = new Regex("http://\\w");

        HttpWebRequest httpwebrequest = (HttpWebRequest)WebRequest.Create("http://maktoob.yahoo.com/?p=us");
        HttpWebResponse response = (HttpWebResponse)httpwebrequest.GetResponse();

        StreamReader sr = new StreamReader(response.GetResponseStream());

        string line;

        while ((line = sr.ReadLine()) != null)
        {
            Match m = r.Match(line);
            if (m.Success)
            {
                Console.WriteLine("Match: " +m.Value);
            }
        }
        sr.Close();
        response.Close();

But the result is :

Match: http://l Match: http://w Match: http://x Match: http://l Match: http://q

It just get the first character after // When i looked at my pattern i said lol yeah my pattern is http://\w , so it will get the first character , but i wanted to know what should i add to my pattern for it to get the rest of the link ????

R.Vector
  • 1,669
  • 9
  • 33
  • 41
  • 1
    possible duplicate of [What is the best regular expression to check if a string is a valid URL?](http://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url) – L.B Feb 14 '12 at 21:29

4 Answers4

1

If you only need to match hyperlinks within <a> elements, then you could take advantage of the enclosing quotes or double quotes to delimit your URL.

Regex regex = new Regex(@"(?<=href=('|""))https?://.*?(?=\1)");

That would match any text within an href='…' or href="…" attribute that starts with http:// or https://.

Douglas
  • 53,759
  • 13
  • 140
  • 188
0

This regex should do what you want. Note that this regex will support HTTPS in addition to HTTP

https?://\\w*
Nick Garvey
  • 2,980
  • 24
  • 31
0

Try https?://[^'\"]+

This should work since links are usually enclosed in quotes

[edit] or even better, just match only valid characters. [abc]+ will match one or more of each character between the brackets. Look at this answer for list of valid characters

Community
  • 1
  • 1
Rado
  • 8,634
  • 7
  • 31
  • 44
  • Oh that is so much better , but i don`t understand that pattern , could you little explain it to me . – R.Vector Feb 14 '12 at 21:31
  • [^abc] says match anything except the letter a, b, or c. The ^ means not, without it, you match only the characters within the brackets. I used quotes, since in HTML, links are generally enclosed within quotes, so the pattern will match from http to whenever it finds a quote – Rado Feb 14 '12 at 21:37
0

How accurate/robust do you want to be? One of the best regex expressions I've found so far matches just about all of the URLs one could possibly throw at it:

_^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?@)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:/[^\s]*)?$_iuS

You can see the full comparison table of URL validation regular expressions here: http://mathiasbynens.be/demo/url-regex

Update

As Douglas mentioned, if you want to match links, rather than any text that passes as a URL-schema, then you can look at the anchor tags. However, people can put ANYTHING in an anchor tag for example: <a href="http://junk,.sdf8(_.jf/.klkjl">Junk Link</a> so you will still need to validate if the URL has the correct schema.

Kiril
  • 39,672
  • 31
  • 167
  • 226