4

I am create a C# 4.0 application to download the webpage content using Web client.

WebClient function

    public static string GetDocText(string url)
    {
        string html = string.Empty;
        try
        {
            using (ConfigurableWebClient client = new ConfigurableWebClient())
            {
                /* Set timeout for webclient */
                client.Timeout = 600000;

                /* Build url */
                Uri innUri = null;
                if (!url.StartsWith("http://"))
                    url = "http://" + url;

                Uri.TryCreate(url, UriKind.RelativeOrAbsolute, out innUri);

                try
                {
                    client.Headers.Add("User-Agent", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR " + "3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; InfoPath.2; AskTbFXTV5/5.15.4.23821; BRI/2)");
                    client.Headers.Add("Vary", "Accept-Encoding");
                    client.Encoding = Encoding.UTF8;
                    html = client.DownloadString(innUri);
                    if (html.Contains("Pagina non disponibile"))
                    {
                        string str = "site blocked";
                        str = "";
                    }

                    if (string.IsNullOrEmpty(html))
                    {
                        return string.Empty;
                    }
                    else
                    {
                        return html;
                    }

                }
                catch (Exception ex)
                {
                    return "";
                }
                finally
                {
                    client.Dispose();
                }
            }
        }
        catch (Exception ex)
        {
            return "";
        }
    }

    public class ConfigurableWebClient : WebClient
    {
        public int? Timeout { get; set; }

        public int? ConnectionLimit { get; set; }

        protected override WebRequest GetWebRequest(Uri address)
        {

            var baseRequest = base.GetWebRequest(address);

            var webRequest = baseRequest as HttpWebRequest;

            if (webRequest == null)

                return baseRequest;

            if (Timeout.HasValue)

                webRequest.Timeout = Timeout.Value;

            if (ConnectionLimit.HasValue)

                webRequest.ServicePoint.ConnectionLimit = ConnectionLimit.Value;

            return webRequest;

        }
    }

I examine the download content in C# Web client it's slightly different than the browser

content. I give the same URL in browser ( Mozilla Firefox ) and my web client function.

the webpage shows the content correctly but my Web client DownloadString is returns another

HTML. Please see my the Web Client response below.

Webclient downloaded html

<!DOCTYPE html>
<head>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
<meta http-equiv="cache-control" content="max-age=0" />
<meta http-equiv="cache-control" content="no-cache" />
<meta http-equiv="expires" content="0" />
<meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT" />
<meta http-equiv="pragma" content="no-cache" />
<meta http-equiv="refresh" content="10; url=/distil_r_captcha.html?Ref=/pgol/4-abbigliamento/3-Roma%20%28RM%29/p-7&distil_RID=A8D2F8B6-B314-11E3-A5E9-E04C5DBA1712" />
<script type="text/javascript" src="/ga.280243267228712.js?PID=6D4E4D1D-7094-375D-A439-0568A6A70836" defer></script><style type="text/css">#d__fFH{position:absolute;top:-5000px;left:-5000px}#d__fF{font-family:serif;font-size:200px;visibility:hidden}#glance7ca96c1b,#hiredf795fe70,#target01a7c05a,#hiredf795fe70{display:none!important}</style></head>
<body>
<div id="distil_ident_block">&nbsp;</div>
<div id="d__fFH"><OBJECT id="d_dlg" CLASSID="clsid:3050f819-98b5-11cf-bb82-00aa00bdce0b" width="0px" height="0px"></OBJECT><span id="d__fF"></span></div></body>
</html>

My problem is my Webclient function is not returned the actual webpage content.

halfer
  • 19,824
  • 17
  • 99
  • 186
Ragesh P Raju
  • 3,879
  • 14
  • 101
  • 136
  • `WebClient` and `WebBrowser` use different User Agent strings and run on completely different sessions. The page may render differently for different user agents. Use either `WebClient` or `WebBrowser`. If you decide to proceed with `WebBrowser`, check [this](http://stackoverflow.com/a/22262976/1768303). – noseratio Mar 24 '14 at 07:40
  • Thank you for your valuable reply. There is any mistake in my UserAgentString. I don't want to use WebBrowser control. – Ragesh P Raju Mar 24 '14 at 07:44
  • 1
    I don't think it's *only* UA string. There are probably some other HTTP headers different between `WebBrowser` and `WebClient`. Try spying on both with Fiddler. Moreover, `WebClient` doesn't support any client-side scripts. That also may affect the page view. – noseratio Mar 24 '14 at 07:48
  • Clean up browser's cookies and try again. – Oleg Mar 24 '14 at 09:25
  • Hai Oleg thank you for your valuable reply. I am not using the webbrowser control in my application. I only use the Webclient in my application. – Ragesh P Raju Mar 24 '14 at 09:49

2 Answers2

3

Some Web Program respond different by HTTP Request Header.

so, if you want to same HTML as web browser's,

then you will send same HTTP Request which of your Web Browser!

how?

Using Firefox Developer tool or Chrome Developer Tool, and Copy The HTTP Request!

enter image description here

enter image description here

han058
  • 908
  • 8
  • 19
1

In my case WebClient's DownloadData/DownloadFile/DownloadString methods showed different results than when downloading the file from a browser, like Chrome. First I thought it was an encoding problem and looped through all the encodings from Encoding.GetEncodings(), but the output data showed nonsense characters. Then after much searching I ended up here.

I looked at the Response headers in the Chrome browser Network tab as @han058 suggested and it read:

Cache-Control: public, max-age=900
content-disposition: attachment;filename=FILENAME.csv
Content-Encoding: gzip
Content-Length: 29310
Content-Type: text/plain; charset=utf-8
Date: Sat, 04 Jan 2020 20:20:13 GMT
Expires: Sat, 04 Jan 2020 20:35:14 GMT
Last-Modified: Sat, 04 Jan 2020 20:20:14 GMT
Server: Microsoft-IIS/10.0
Vary: *
X-Powered-By: ASP.NET
X-Powered-By: ARR/3.0
X-Powered-By: ASP.NET

So the response was encoded Content-Encoding: gzip. In other words, I had to unzip the file, before I could read it.

using System;
using System.IO;
using System.IO.Compression;
using System.Net;

public class Program
{
    static void Main(string[] args)
    {
        var url = new Uri("http://www.url.com/FILENAME.csv");
        var path = Environment.GetFolderPath(Environment.SpecialFolder.Desktop);
        var fileName = "File.csv";

        using (WebClient wc = new WebClient())
        using (Stream s = File.Create(Path.Combine(path, fileName)))
        using (GZipStream gs = new GZipStream(wc.OpenRead(url), CompressionMode.Decompress))
        {
            //Saves to C:\Users\[YourUser]\Desktop\File.csv
            gs.CopyTo(s);
        }
    }
}
Joel Wiklund
  • 1,697
  • 2
  • 18
  • 24