6

I am trying to download the contents of a website. However for a certain webpage the string returned contains jumbled data, containing many � characters.

Here is the code I was originally using.

HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(url);
req.Method = "GET";
req.UserAgent = "Mozilla/5.0 (Windows; U; MSIE 9.0; WIndows NT 9.0; en-US))";
string source;
using (StreamReader reader = new StreamReader(req.GetResponse().GetResponseStream()))
{
    source = reader.ReadToEnd();
}
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(source);

I also tried alternate implementations with WebClient, but still the same result:

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
using (WebClient client = new WebClient())
using (var read = client.OpenRead(url))
{
    doc.Load(read, true);
}

From searching I guess this might be an issue with Encoding, so I tried both the solutions posted below but still cannot get this to work.

The offending site that I cannot seem to download is the United_States article on the english version of WikiPedia (en . wikipedia . org / wiki / United_States). Although I have tried a number of other wikipedia articles and have not seen this issue.

Nick Collier
  • 741
  • 1
  • 7
  • 9

3 Answers3

3

Using the built-in loader in HtmlAgilityPack worked for me:

HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://en.wikipedia.org/wiki/United_States");
string html = doc.DocumentNode.OuterHtml; // I don't see no jumbled data here

Edit:

Using a standard WebClient with your user-agent will result in a HTTP 403 - forbidden - using this instead worked for me:

using (WebClient wc = new WebClient())
{
    wc.Headers.Add("user-agent", "Mozilla/5.0 (Windows; Windows NT 5.1; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4");
    string html = wc.DownloadString("http://en.wikipedia.org/wiki/United_States");
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);
}

Also see this SO thread: WebClient forbids opening wikipedia page?

Community
  • 1
  • 1
BrokenGlass
  • 158,293
  • 28
  • 286
  • 335
  • I tried the first method you suggested and got the following error: 'gzip' is not a supported encoding name. Parameter name: name at System.Globalization.EncodingTable.internalGetCodePageFromName(String name) at System.Globalization.EncodingTable.GetCodePageFromName(String name) – Nick Collier Sep 22 '11 at 16:41
  • @Nick: Worked fine for me - make sure you have the latest version of HtmlAgilityPack - I got mine from NuGet – BrokenGlass Sep 22 '11 at 16:45
  • This is still failing with the same error after getting HtmlAgilityPack from NuGet. The version installed by NuGet is 1.4.0.0. – Nick Collier Sep 22 '11 at 17:01
  • @Nick - that's strange *both* definitely work for me here - not much I can help you further with since I can't reproduce the problem unfortunately – BrokenGlass Sep 22 '11 at 18:46
2

The response is gzip encoded. Try the following to decode the stream:

UPDATE

Based on the comment by BrokenGlass setting the following properties should solve your problem (worked for me):

req.Headers[HttpRequestHeader.AcceptEncoding] = "gzip, deflate";
req.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;

Old/Manual solution:

string source;
var response = req.GetResponse();

var stream = response.GetResponseStream();
try
{
    if (response.Headers.AllKeys.Contains("Content-Encoding")
        && response.Headers["Content-Encoding"].Contains("gzip"))
    {
        stream = new System.IO.Compression.GZipStream(stream, System.IO.Compression.CompressionMode.Decompress);
    }

    using (StreamReader reader = new StreamReader(stream))
    {
        source = reader.ReadToEnd();
    }
}
finally
{
    if (stream != null)
        stream.Dispose();
}
Peter
  • 3,916
  • 1
  • 22
  • 43
  • 1
    You should never do this manually, this is built in already, i.e. see this answer: http://stackoverflow.com/questions/2973208/automatically-decompress-gzip-response-via-webclient-downloaddata – BrokenGlass Sep 22 '11 at 16:46
  • @BrokenGlass Thanks for the hint. I already wondered why I never had issues with gzip encoding before. – Peter Sep 22 '11 at 16:54
1

This is how I usually grab a page into a string (its VB, but should translate easily):

req = Net.WebRequest.Create("http://www.cnn.com")
Dim resp As Net.HttpWebResponse = req.GetResponse()
sr = New IO.StreamReader(resp.GetResponseStream())
lcResults = sr.ReadToEnd.ToString

and haven't had the problems you are.

E.J. Brennan
  • 45,870
  • 7
  • 88
  • 116