0

Ok so i am trying to extract all the links from the google homepage via regular expressions.

But i am facing a baffling problem. When i send the request to the google homepage and try to extract all the links from the page i usually get one result and rest is garbage. However when i manually view the source of the page and extract some link to test against the pattern it works.

Now i don't know what is wrong here i think my pattern is flawed, i am trying hard to get it right or maybe google is sending different responses to my code and browser. I really appreciate if i can get some insight into this problem.

My pattern

string pattern = @"=("")?(https?:\/\/)?[\w.-]+\.[\w]*([/]?[\w]*)*("")?";

My display code

Match match = Regex.Match(source, pattern);
        if (match.Success)
        {
            foreach (var res in match.Groups)
            {
                Console.WriteLine(res);
            }
            Console.ReadKey();
        }
Win Coder
  • 6,628
  • 11
  • 54
  • 81
  • We need the contents of `source`... – It'sNotALie. Aug 06 '13 at 21:54
  • Put a breakpoint in your code and extract whatever data is attached to `source` and you'll see what it is Google is showing your program. – Logarr Aug 06 '13 at 21:55
  • well that's gonna be the whole source of the google homepage not sure if that's appropriate to post – Win Coder Aug 06 '13 at 21:55
  • Post it in a formatted code section and it will not take up the whole page. – Logarr Aug 06 '13 at 21:57
  • 2
    Why don't you use an html parser like [HtmlAgilityPack](http://htmlagilitypack.codeplex.com/) instead of regex? – I4V Aug 06 '13 at 21:58
  • Surely its easy to test if your code is getting different HTML than your browser... – Chris Aug 06 '13 at 22:01
  • Your regex is very permissive. The only required elements of a url are: at least one of `[\w.-]` followed by a dot. The rest is all optional. So "`..`" is a match. Trying it against google source code, pulled from google.com, I see a valid url as the first match, then a match on "`more.`", and match on "`for.`" a match on "`window.google`", etc. – femtoRgon Aug 06 '13 at 22:02
  • @femtoRgon Thanks for pointing out the mistakes i am really new to regular expression and will try to improve my regex. – Win Coder Aug 06 '13 at 23:12

3 Answers3

3

Don't try and parse HTML with regex. Use an HTML parser instead such as the Html Agility Pack. This gets all href links from the given webpage (from their example page)

HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load(myURL);
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href]"))
{
}
Community
  • 1
  • 1
keyboardP
  • 68,824
  • 13
  • 156
  • 205
  • Fair enough but just something to be wary of in case you think about using them for HTML/XML parsing :) – keyboardP Aug 06 '13 at 23:20
3

I think the problem is that you're only getting one match. You need to call Matches, or run a loop:

Matches ms = Regex.Matches(source, pattern);
foreach (var m in ms)
{
    Console.WriteLine(m.Value);
}

or ...

Match m = Regex.Match(source, pattern);
while (m.Success)
{
    Console.WriteLine(m.Value);
    m = m.NextMatch();
}

Note that you shouldn't in general try to parse HTML with regular expressions. There lies madness. But if you don't care that some of the "links" you pick up aren't really links (i.e. they might be text rather than hrefs), then using a regular expression this way isn't a problem.

By the way, there is an MSDN article, Example: Scanning for HREFs, that you might find useful.

Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
  • yes i know regular expression for parsing HTMl is not good. But i am using them for the purpose of learning. – Win Coder Aug 06 '13 at 23:11
-1

Ok i think i figured out the problem.Regex.Matchonly returns one answer replace it with Regex.Matches to return a whole bunch of links

Win Coder
  • 6,628
  • 11
  • 54
  • 81