C# URL Crawler not getting enough links?

Question

I have the following code, however, when I launch it I only ever seam to get a few URLS returned.

while (stopFlag != true)
{
    WebRequest request = WebRequest.Create(urlList[i]);
    using (WebResponse response = request.GetResponse())
    {
        using (StreamReader reader = new StreamReader
           (response.GetResponseStream(), Encoding.UTF8))
        {
            string sitecontent = reader.ReadToEnd();
            //add links to the list
            // process the content
            //clear the text box ready for the HTML code
            //Regex urlRx = new Regex(@"((https?|ftp|file)\://|www.)[A-Za-z0-9\.\-]+(/[A-Za-z0-9\?\&\=;\+!'\(\)\*\-\._~%]*)*", RegexOptions.IgnoreCase);
            Regex urlRx = new Regex(@"(?<url>(http:[/][/]|www.)([a-z]|[A-Z]|[0-9]|[/.]|[~])*)", RegexOptions.IgnoreCase);

            MatchCollection matches = urlRx.Matches(sitecontent);
            foreach (Match match in matches)
            {
                string cleanMatch = cleanUP(match.Value);
                urlList.Add(cleanMatch);

                updateResults(theResults, "\"" + cleanMatch + "\",\n");

            }
        }
    }
}

I think the error is within the regex.

What I am trying to achieve is pull a webpage, then grab all the links from that page - add these to a list, then for each list item fetch the next page and repeat the process.

score 3 · Accepted Answer · edited Jun 20 '20 at 09:12

3

Instead of trying to use regex to parse HTML, I suggest using a good HTML parser - the HTML Agilty Pack is a great choice:

What is exactly the Html Agility Pack (HAP)?

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

edited Jun 20 '20 at 09:12

Community

1
1

answered Jul 09 '12 at 20:35

Oded

489,969
99
883
1,009

What other HTML parsers are available? The HTML Agility Pack is missing any documentation. – developer__c Jul 09 '12 at 20:41
1

@thatnerdoverthere - There are **lots** of examples in the source download. – Oded Jul 09 '12 at 20:42

C# URL Crawler not getting enough links?

1 Answers1