1

I tried reading a webpage into my program using vb.net's HttpWebRequest. My problem has to do with figuring out the encoding of the web page before I actually read it into a string. When I read it in, some characters appear as diamonds with question-marks inside the diamonds. The web page itself looks fine in a browser, its just when I read it in that some characters aren't encoded right.

The original code I used was:

 Dim myWebRequest As HttpWebRequest = WebRequest.Create(ourUri) 
 Dim myWebResponse As httpWebResponse = myWebRequest.GetResponse()

Then to try to get the encoding. I used:

Dim contentType As String = myWebResponse.ContentType
Dim charset As String = myWebResponse.CharacterSet

But both 'contentType' and 'charset' end up with the value 'text/html'. What I want to know is if the webpage is encoded in 'utf-8' or some other character set, so I can later retrieve it like this:

Dim receiveStream As Stream = myWebResponse.GetResponseStream()
Dim encode As Encoding = System.Text.Encoding.GetEncoding(charset)
Using reader As StreamReader = New StreamReader(receiveStream, encode)

So to sum up, there seems to be no way to inquire what the encoding is, and then use the answer to read the webpage the right way.

Is that true? Any help is appreciated.

The entire code of the routine (asked for by a commenter) follows:

Public Shared Function DownloadFileUsingURL(ByVal URLstring As String, ByVal descFilePathAndName As String, ByRef errorMessage As String, ByRef hadRedirect As Boolean,
                                                ByRef newURL As String, ByRef isSecureConnection As Boolean, ByVal didSupplyfullfilename As Boolean, ByRef downloadedcontent As String,
                                                ByRef httpTohttps As Boolean, ByRef httpsTohttp As Boolean, ByRef BytesRead As Integer) As Boolean
    Dim ourUri As New Uri(URLstring)
    Dim csh As New ClassSyncHttp
    Dim expectSecureConnection As Boolean
    Dim domainchanged As Boolean

    newURL = ""
    errorMessage = ""
    hadRedirect = False
    httpTohttps = False
    httpsTohttp = False
    isSecureConnection = False
    If URLstring.ToLower.StartsWith("https:") Then
        ServicePointManager.Expect100Continue = True
        ServicePointManager.SecurityProtocol = SecurityProtocolType.SystemDefault
        expectSecureConnection = True
    Else
        ServicePointManager.SecurityProtocol = SecurityProtocolType.SystemDefault
        expectSecureConnection = False
    End If
    Try
        Dim myWebRequest As HttpWebRequest = WebRequest.Create(ourUri) ' I changed webrequest to httpwebrequest
        Dim cookieContainer As CookieContainer = New CookieContainer   ' needs httpwebrequest to work

        myWebRequest.CookieContainer = cookieContainer
        myWebRequest.Proxy.Credentials = System.Net.CredentialCache.DefaultCredentials
        myWebRequest.UserAgent = "BrowseNet"
        myWebRequest.Timeout = ClassGlobalVariables.downloadTimeoutMilliseconds
        myWebRequest.Credentials = CredentialCache.DefaultCredentials
        Dim myWebResponse As HttpWebResponse = myWebRequest.GetResponse()
        Dim contentType As String = myWebResponse.ContentType
        Dim charset As String = myWebResponse.CharacterSet

        If Not ourUri.Equals(myWebResponse.ResponseUri) Then
            newURL = myWebResponse.ResponseUri.ToString
            hadRedirect = True
            If newURL.ToLower.StartsWith("https") Then
                isSecureConnection = True
            End If
            compareURLs(URLstring, newURL, httpTohttps, domainchanged)
        End If

        Dim receiveStream As Stream = myWebResponse.GetResponseStream()

        If didSupplyfullfilename Then
            Using fs As FileStream = File.Create(descFilePathAndName)
                receiveStream.CopyTo(fs)
                BytesRead = fs.Length
            End Using
        Else
            Dim encode As Encoding = System.Text.Encoding.GetEncoding(charset)
            Using reader As StreamReader = New StreamReader(receiveStream, encode)
                '   receiveStream.Seek(0, SeekOrigin.Begin)
                downloadedcontent = reader.ReadToEnd()
                BytesRead = downloadedcontent.Length
            End Using
        End If
        myWebResponse.Close()

        If expectSecureConnection Then
            isSecureConnection = True
        Else
            isSecureConnection = False
        End If

        Return True
    Catch ex As webException
        If expectSecureConnection Then
            ' guessing here that the problem was that was wrong about secure connection.   (Problem could be elsewhere, of course)
            isSecureConnection = False
            httpsTohttp = True
            If ex.HResult = System.Net.WebExceptionStatus.SecureChannelFailure Then
                ' not sure what to do
            End If
            'Else
            '    isSecureConnection = True
        End If
        errorMessage = ex.Message

        Return False
    End Try


End Function
Mark Springer
  • 195
  • 1
  • 8
  • see https://stackoverflow.com/questions/43148464/how-do-browsers-determine-the-encoding-used – battlmonstr Feb 11 '22 at 22:56
  • Can you provide a link to the page that sets `CharacterSet = "text/html"`? -- Set `Option Strict ON` -- You can find statistical analysis that quantifies the encoding used in HTML pages around the world. The vast majority uses UTF-8, but there's still a number of pages that use `ISO 8859-1`, or `Windows 1252`, even `UTF-16LE` (sic), others use a local encoding (because, well, the local charset is rendered well in their machines). The `CharSet` is usually set correctly, based on what the Server says or the `` tag specifies. There might be edge cases. You should post your actual code. – Jimi Feb 12 '22 at 00:10
  • I just added the entire code to the bottom of my original post. – Mark Springer Feb 12 '22 at 11:46
  • You didn't provide a link to a Html Page that fails the Encoding (with your code) -- `You didn't set Option Strict ON` -- Many things to fix there: `SecurityProtocolType.SystemDefault`: irrelevant (it's already that or it fails - Windows 7 etc.), `.Expect100Continue = True`: irrelevant (already `true` but used only in POST and PUT requests, this is a GET), `.CredentialCache.DefaultCredentials`: what?, unknown `User-Agent`, use, e.g, an old FireFox User-Agent sting instead. Always set `[HttpWebRequest].AutomaticDecompression = DecompressionMethods.GZip Or DecompressionMethods.Deflate` – Jimi Feb 13 '22 at 14:59
  • The HttpWebResponse stream performs encoding auto-detection when reading the bytes it's been receiving, then it sets the `CharacterSet` to what it has determined is the actual encoding used. It can detect all known Encodings (from the `CodePage` references that the Windows System knows about - quite a lot) -- You should get the Stream as a byte array and then decode it using the Encoding detected. If, for some reason, it turns out the Encoding is not correct (an edge case, but possible), you have only complex heuristic techniques, including the Region of the source IP Address etc. – Jimi Feb 13 '22 at 15:10
  • Here is the webpage that failed: http://thoughts-everything.com/shelp/mythdiff.htm but it turns out that it does work - almost. I was wrong about the charset, it was ISO-... (I was confusing it with another variable). The page itself though, when downloaded with my code, produces some wrong characters. I have to look further into that. Anyway, thanks for your help. – Mark Springer Feb 13 '22 at 18:42

0 Answers0