30

I have a webpage which has nothing on it except some string(s). No images, no background color or anything, just some plain text which is not really that long in length.

I am just wondering, what is the best (by that, I mean fastest and most efficient) way to pass the string in the webpage so that I can use it for something else (e.g. display in a text box)? I know of WebClient, but I'm not sure if it'll do what I want it do and plus I don't want to even try that out even if it did work because the last time I did it took approximately 30 seconds for a simple operation.

Any ideas would be appreciated.

Robert Harvey
  • 178,213
  • 47
  • 333
  • 501
Iceyoshi
  • 623
  • 3
  • 10
  • 14
  • 1
    The WebClient class is the natural choice here. The webclient shouldn't take 30 seconds to run (assuming no other network problems). – Jimmy Jan 21 '11 at 11:32
  • 1
    Your choices are limited to WebClient or WebRequest/WebResponse (which is what WebClient uses under the scenes, so just go for WebClient). As to why it is slow this is something that has nothing to do with the implementation of the .NET HTTP stack. It could be network problems, poor implementation of the web site you are trying to fetch which makes it slow to return a response, ... For example running a web client on a correctly written web site such as http://www.google.com it takes a few milliseconds to fetch the response which is far less than the 30s you are observing with your site. – Darin Dimitrov Jan 21 '11 at 11:34
  • By pass do your mean parse? if so what technology are you parsing it with? i.e. what kind of text box win-forms, another website? – Rob Jan 21 '11 at 11:37

6 Answers6

38

The WebClient class should be more than capable of handling the functionality you describe, for example:

System.Net.WebClient wc = new System.Net.WebClient();
byte[] raw = wc.DownloadData("http://www.yoursite.com/resource/file.htm");

string webData = System.Text.Encoding.UTF8.GetString(raw);

or (further to suggestion from Fredrick in comments)

System.Net.WebClient wc = new System.Net.WebClient();
string webData = wc.DownloadString("http://www.yoursite.com/resource/file.htm");

When you say it took 30 seconds, can you expand on that a little more? There are many reasons as to why that could have happened. Slow servers, internet connections, dodgy implementation etc etc.

You could go a level lower and implement something like this:

HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create("http://www.yoursite.com/resource/file.htm");

using (StreamWriter streamWriter = new StreamWriter(webRequest.GetRequestStream(), Encoding.UTF8))
{
    streamWriter.Write(requestData);
}

string responseData = string.Empty;
HttpWebResponse httpResponse = (HttpWebResponse)webRequest.GetResponse();
using (StreamReader responseReader = new StreamReader(httpResponse.GetResponseStream()))
{
    responseData = responseReader.ReadToEnd();
}

However, at the end of the day the WebClient class wraps up this functionality for you. So I would suggest that you use WebClient and investigate the causes of the 30 second delay.

MrEyes
  • 13,059
  • 10
  • 48
  • 68
  • 2
    Alternatively, use the [DownloadString](http://msdn.microsoft.com/sv-se/library/fhd1f0sw.aspx) method and get rid of the byte array handling: `string result = wc.DownloadString(...` – Fredrik Mörk Jan 21 '11 at 11:40
  • 1
    I coded a button that would save a page (though one that has a quite a bit of traffic) using the WebClient class and then replace some contents in a file with some of the contents of the page. Using a stopwatch I timed how long it took and it varied from 10s-40s. The internet connection may have been bad but I doubt it that was the main reason. Unfortunately I no longer have the code for that button otherwise I would have posted it. :\ – Iceyoshi Jan 21 '11 at 11:41
  • @Fredrik : +1 for the DownloadString suggestion – MrEyes Jan 21 '11 at 11:42
  • btw, does the page download more quickly when viewed for a browser? Also, is the web page secure - in which case validating certificates can take about 40 seconds if your computer cannot contact the root certificate. In my experience DNS misconfiguration can often lead to slow network response. – Jimmy Jan 21 '11 at 11:44
  • @Iceyoshi : How much data were you downloading? A couple of KB or a couple of MB? Also are you sure the delay was on the WebClient call and not on the subsequent parsing/replacing? – MrEyes Jan 21 '11 at 11:46
  • @Iceyoshi, have you considered the possibility that the web site you are trying to access is throttling your request? There are sites which would do this if multiple requests are sent from the same IP to avoid plagiarizing their intellectual property. – Darin Dimitrov Jan 21 '11 at 11:55
  • I think it was approximately 700KB. Also, I believe the problem was with the WebClient call because when I uncommented out the code related to it, the operation wouldn't take more than a few seconds. Darin Dimitrov: You do have a valid point, although I wouldn't say I was plagiarizing their property. Anyway, I am marking this answer correct because it seems to be the best way (thanks for telling about DownloadString). A small problem I am facing with it is that the first time the WebClient sends the request it takes about 5 seconds, and after than it does it faster (in 1 second or so). – Iceyoshi Jan 23 '11 at 08:18
  • The above code getting the html output as string. But in my purpose i need to get a price of trading value in this site: https://www.mcxindia.com/en/market-data/get-quote/ZINCMINI/30NOV2017 The values will change frequently. Given codes not getting this value. it shows empty tags... – Bala Nov 23 '17 at 10:25
8

If you're downloading text then I'd recommend using the WebClient and get a streamreader to the text:

        WebClient web = new WebClient();
        System.IO.Stream stream = web.OpenRead("http://www.yoursite.com/resource.txt");
        using (System.IO.StreamReader reader = new System.IO.StreamReader(stream))
        {
            String text = reader.ReadToEnd();
        }

If this is taking a long time then it is probably a network issue or a problem on the web server. Try opening the resource in a browser and see how long that takes. If the webpage is very large, you may want to look at streaming it in chunks rather than reading all the way to the end as in that example. Look at http://msdn.microsoft.com/en-us/library/system.io.stream.read.aspx to see how to read from a stream.

Phill
  • 1,302
  • 1
  • 9
  • 18
1

Regarding the suggestion So I would suggest that you use WebClient and investigate the causes of the 30 second delay.

From the answers for the question System.Net.WebClient unreasonably slow

Try setting Proxy = null;

WebClient wc = new WebClient(); wc.Proxy = null;

Credit to Alex Burtsev

Community
  • 1
  • 1
Tester
  • 11
  • 1
0
 WebClient client = new WebClient();
            using (Stream data = client.OpenRead(Text))
            {
                using (StreamReader reader = new StreamReader(data))
                {
                    string content = reader.ReadToEnd();
                    string pattern = @"((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:#@%/;$()~_?\+-=\\\.&]*)";
                    MatchCollection matches = Regex.Matches(content,pattern);
                    List<string> urls = new List<string>();
                    foreach (Match match in matches)
                    {
                            urls.Add(match.Value);
                    }

              }
0
XmlDocument document = new XmlDocument();
document.Load("www.yourwebsite.com");
string allText = document.InnerText;
  • 1
    Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Oct 15 '22 at 10:14
0

If you use the WebClient to read the contents of the page, it will include HTML tags.

string webURL = "https://yoursite.com";
WebClient wc = new WebClient();
wc.Headers.Add("user-agent", "Only a Header!");
byte[] rawByteArray = wc.DownloadData(webURL);
string webContent = Encoding.UTF8.GetString(rawByteArray);

After getting the content, the html tags should be removed. Regex can be used for this:

var result= Regex.Replace(webContent, "<.*?>", String.Empty);

But this method is not very accurate, the better way is to install HtmlAgilityPack and use the following code:

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(webData);
string result = doc.DocumentNode.InnerText;

You say it takes 30 seconds, It has nothing to do with using WebClient (The main factor is internet connections or proxy). WebClient has worked very well for me. example

Hossein Sabziani
  • 1
  • 2
  • 15
  • 20