1

iam trying to crawl some sites. It works like a charm. But there is a major problem. On some pages (not mutch) I'm getting some weird characters instead of html code.

It looks like this:

;�<cS���u�/�qYa$�4l7�.�Q�7&��O����� Z�D}z��/���� ��u����V���lWY|�n5�1�We����GB�U��g{�� �|Ϸ����*�Q��0���nb�o�߯�����[b��/����@CƑ����D{{/n��X�!� �Et�X"����?��˩����8\y��&

If I'll open it in my browser, there is no Problem at all. I dont understand why.

My HTTP Header says:

Accept:text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8 Accept-Encoding:gzip,deflate,sdch Accept-Language:de-DE,de;q=0.8,en-US;q=0.6,en;q=0.4 Cache-Control:max-age=0 Connection:keep-alive User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36

I think it has something to do with the Accept

request.Accept = "*/*"

Thats my webrequest:

Public Class Http
    Dim cookieCon As New CookieContainer
    Dim request As HttpWebRequest
    Dim response As HttpWebResponse

    Public Function GetRequest(ByVal Params() As Object)
        Dim url As String = Params(0)
        Dim mycookie As String = Params(1)
        'request.AllowAutoRedirect = True
        request = CType(HttpWebRequest.Create(url), HttpWebRequest)
        request.CookieContainer = New CookieContainer()
        request.Method = "GET"
        request.Timeout = 20000
        request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"
        'request.ContentType = "application/x-www-form-urlencoded"
        request.Accept = "*/*"
        If Not mycookie Like "nocookie" Then
            request.Headers("Cookie") = mycookie
        End If
        response = CType(request.GetResponse(), HttpWebResponse)
        Dim html(1) As String
        html(0) = request.Address.ToString()
        html(1) = New StreamReader(response.GetResponseStream()).ReadToEnd()

        Return html
    End Function

Thanks.

Andrew Morton
  • 24,203
  • 9
  • 60
  • 84
Ponch0
  • 59
  • 2
  • 2
  • 9

1 Answers1

1

The data you are downloading is GZip compressed. You need to decompress it. Change your function to this:

 Dim request As HttpWebRequest
Dim response As HttpWebResponse
Public Function GetRequest(ByVal Params() As Object) As String()
    Dim url As String = Params(0)
    Dim mycookie As String = Params(1)
    'request.AllowAutoRedirect = True
    request = CType(HttpWebRequest.Create(url), HttpWebRequest)
    request.CookieContainer = New CookieContainer()
    If Not mycookie Like "nocookie" Then
        request.Headers("Cookie") = mycookie
    End If
    request.AutomaticDecompression = DecompressionMethods.GZip
    response = CType(request.GetResponse(), HttpWebResponse)

    Dim html(1) As String
    html(0) = request.Address.ToString()
    html(1) = New StreamReader(response.GetResponseStream).ReadToEnd()

    Return html
End Function

Usage:

Dim params(1) As Object
params(0) = url

Dim page As String = GetRequest(params)(1)
Hanlet Escaño
  • 17,114
  • 8
  • 52
  • 75
  • 1
    Apparently that can be done automatically: [.NET: Is it possible to get HttpWebRequest to automatically decompress gzip'd responses?](http://stackoverflow.com/questions/2815721/net-is-it-possible-to-get-httpwebrequest-to-automatically-decompress-gzipd-re) – Andrew Morton Sep 11 '13 at 18:10
  • @AndrewMorton, yeah I suppose so, both ways solve his problem. – Hanlet Escaño Sep 11 '13 at 18:11
  • 1
    awweeesome! Thanks alot for your help. Really appreciate that!! – Ponch0 Sep 11 '13 at 18:35