0

I have created a program that downloads links (from a web page) into a htm file. What I am hoping to do is test each one of those links within the htm file and output any links that are broken. Unfortunately not all of the downloaded links start with "http://" so I tried to avoid this problem by using an if statement. How can I read all links into an Array and then loop through that array with async web requests and responses.

private async void button4_Click(object sender, EventArgs e)
    {
        string text =  System.IO.File.ReadAllText(@"C:\\Users\\Conal_Curran\\OneDrive\\C#\\MyProjects\\Web Crawler\\URLTester\\OP.htm");

        List<string> stringlist = new List<string>();
        stringlist.Add(text);


        if (!text.StartsWith("http://"))
        {

            foreach (string line in stringlist)
            {
                var request = WebRequest.Create(text);
                var response = (HttpWebResponse)await Task.Factory
                .FromAsync<WebResponse>(request.BeginGetResponse,     request.EndGetResponse, null);

                Debug.Assert(response.StatusCode == HttpStatusCode.OK);

                if (response == null)
                {
                    BrokenLinks.Text = text;
                }
                else
                {
                    BrokenLinks.Text = "All URLS Are OK";
                }
            }
        }

Regex to parse the html file:

string text = System.IO.File.ReadAllText(@"C:\\Users\\Conal_Curran\\OneDrive\\C#\\MyProjects\\Web Crawler\\URLTester\\OP.htm");

        string regex = "href=\"(.*)\"";
        Match match = Regex.Match(text, regex);
        if (match.Success)
        {
            string link = match.Groups[1].Value;
            Console.WriteLine(link);

            MessageBox.Show("Going over URLS now Please stand by.");
            var request = WebRequest.Create(link);
            var response = (HttpWebResponse)await Task.Factory
                .FromAsync<WebResponse>(request.BeginGetResponse, request.EndGetResponse, null);

            Debug.Assert(response.StatusCode == HttpStatusCode.OK);

            if (response == null)
            {
                BrokenLinks.Text = text;
                label2.ForeColor = System.Drawing.Color.Red;
            }
            else
            {
                BrokenLinks.Text = "All URLS Are OK";
                label2.ForeColor = System.Drawing.Color.Green;
            }


        }
Conall Curran
  • 61
  • 1
  • 10
  • I think that it isn't clear what you're trying to achieve, do you want to add http:// where needed or test the links with a request? or both? – Daniele Sassoli Jan 07 '16 at 14:37
  • @DenisBokor I already have a txt file containing html links, as with the webrequest tool it must start with "http://" in order for it to run, but unfortunately not all of the links within the text file start with "http://" so I need send async web requests to each link (within the file) that starts with "http://" only everything else is to be ignored? – Conall Curran Jan 07 '16 at 14:45
  • maybe I'm missing something obvious, but reading your code: first you're reading an html file, and without any parsing you're adding a string to a list. Then, why do you check if the string doesn't start with http://? it's an html file, it will never start with http://. Then you're doing a foreach loop on a list that contains only one element, why? Maybe I'm missing something obvious here, but it looks to me like you need to revisit the logic of your code. Could you please add the content of the file you're reading? – Daniele Sassoli Jan 07 '16 at 14:53
  • @ Denis Bokor The obvious thing you're missing is that I'm not too sure how to do any of that. Do you parse the htm file first and then put it into a List? basically I need all of the http:// links within that file so I can test them? Ill amend the code: – Conall Curran Jan 07 '16 at 14:58

1 Answers1

0

I think that this piece of code should put you on the right way. Obviously this will work only if the file you're reading is a txt file with one link for line.

var lines = File.ReadLines(fileName);//this reads the file one l
    foreach (var line in lines){
        if (text.StartsWith("http://")){
            //execute your request, since it looks like a valid link
        } else {
        //in this the case url dosn't start with http:// if you want to check it just add http:// to the beginning of the string, otherwise don't do anything.
        }
    }

if you want to check if the link is valid or not please refer to this answer. I hope this helps you.

Community
  • 1
  • 1
Daniele Sassoli
  • 899
  • 12
  • 34
  • @ Denis Bonkor Thanks Ill try this code. The file itself contains a list of some links beginning with "http://" and others that begin with hashtags. – Conall Curran Jan 07 '16 at 15:19