1

(UPDATED: This question get updated after a better understanding of what's going on, I removed noisy wrong parts.)

This piece of code belongs to a COM DLL. For the context, it contains an ActiveX object which is created and handled inside a classic ASP Page.

Public Function getHttpResponse(url As String)
  Dim request As New WinHttpRequest
  
  On Error GoTo errGetHttpResponse
  request.Open "HEAD", url
  request.Send
  getHttpResponse = request.Status
  Exit Function
  
errGetHttpResponse:
  getHttpResponse = Err.Description
End Function

The Err.Description string localization and encoding seems to rely on the running environment. For an error 80072EE5, I get for Err.Description (retrieved from the GUI of an executable loading the DLL):

  • Machine 1, Windows 2008 SP2 32bits, French language: L'URL n'est pas valide
  • Machine 2, Windows 2012 R2 Standard 64bits, French language: L’URL n’est pas valide, hexdump gives:

It is noticeable that the apostrophe isn't the same.

The ASP page calling and displaying the ActiveX output could be simplified like this:

Session.LCID=1036 ' French identifier (global.asa file)
Dim o : Set o = CreateObject("TheDLL.TheObject")
Response.Write o.getHttpResponse(anInvalidUrl)

While it renders correctly if run from Machine 1, from Machine 2 the apostrophe is hidden (source viewing on Firefox show a 00 92 square character). The generated HTML from Machine 2 is reproduced below:

00000000  4c 92 55 52 4c 20 6e 92  65 73 74 20 70 61 73 20  |L.URL n.est pas |
00000010  76 61 6c 69 64 65                                 |valide

The apostrophe became encoded as 0x92, which is RIGHT SINGLE QUOTATION MARK encoded in ISO-8859-1.

How is the difference explainable? And optionally is there a way to set the get this output platform-independent?

(And yes, this is a soon-to-die legacy code.)

Amessihel
  • 5,891
  • 3
  • 16
  • 40
  • Windows uses UTF-16, not UTF-8. `WinHttpRequest`, being a COM object, should work in BSTR/OLESTRING, which are exclusively UTF-16 with no Ansi option. So the first question would be regarding the "checke with NotePad++" and the meaning that you attach to it. And the second question would be, what is actually your question? You have described a situation, but you haven't asked a question. – GSerg Jul 02 '21 at 10:44
  • @GSerg indeed I assumed too quickly the real encoding of those strings, playing with the encoding option of a browser leads me to figured the encoding out. My question was the one told in the title. I added another one at the end of its body. – Amessihel Jul 02 '21 at 11:47
  • 2
    It's not VB6 that generates the message, it's `WinHttpRequest`. Most likely it has localized resources, or taps into the standard Windows error codes and messages that are also localized. All that is implementation details that are subject to change at any moment. Your VB6 program should not depend on that. If you want to handle an error meaningfully, examine the error code, not the message. – GSerg Jul 02 '21 at 13:15
  • @GSerg, I didn't mean it was VB6 itself which generates the message. I was wondering about localization encoding handling. The _encoding_ was the real issue, not the message itself. The main thing I understand from your comment is that even the encoding may vary from a Windows version to another. Thanks. – Amessihel Jul 02 '21 at 18:23
  • No, the encoding does not vary. VB6 and COM use UTF-16 Unicode, so it's always that. `L'URL` can be encoded in UTF-16 just as well as `L’URL`. – GSerg Jul 02 '21 at 18:27
  • 2
    The problem is not with the strings themselves, but in how those strings are encoded for display in HTML. That happens separate from the code shown. Is there a reason why you are not encoding your HTML pages in UTF-8 instead of using Windows-xxxx or ISO-8859-x? – Remy Lebeau Jul 02 '21 at 19:22
  • @GSerg, aren't UTF-16 code points _at least_ of 2 bytes? – Amessihel Jul 02 '21 at 19:25
  • @Amessihel They are indeed. – GSerg Jul 02 '21 at 20:00
  • @GSerg, so I can tell both output strings aren't encoded with UTF-16. (I'm updating my question) – Amessihel Jul 02 '21 at 20:02
  • 1
    @Amessihel `Err.Description` is in UTF-16 simply because all instances of `String` in VB6 are. What happens to this UTF-16 data if you pass it to various IO mechanisms is [another story](https://stackoverflow.com/a/23980044/11683). – GSerg Jul 02 '21 at 20:25
  • @RemyLebeau, yes, the reason lies there: "this is a soon-to-die legacy code." :-) I said in the body it is not an issue related to ASP HTML generation but I was wrong. Question updated. – Amessihel Jul 02 '21 at 20:36
  • @GSerg, yes, I wasn't talking to the internal representation of `String` vars, but what happens on the output (thanks for the another story link). – Amessihel Jul 02 '21 at 20:41
  • 1
    @Amessihel What happens on the output is completely detached from what Windows gives you as the error message like Remy Lebeau has noted. VB6 gives you UTF-16, you then [specify](https://stackoverflow.com/a/15392223/11683) whatever encoding you want for your HTML output, and if you don't, you'll get [some default](https://learn.microsoft.com/en-us/previous-versions/iis/6.0-sdk/ms524628(v=vs.90)#remarks). – GSerg Jul 02 '21 at 20:44
  • @Amessihel To output the HTML as UTF-8, set `Response.CodePage = 65001` and `Response.CharSet = "UTF-8"`. – Remy Lebeau Jul 02 '21 at 20:49
  • @RemyLebeau, I was triying from my side `Response.charset ="windows-1252"` and it worked. If I understand, 1) Response.charset is by default Windows-1252 on machine 1 and ISO-8859-1 on machine 2, and 2) The apostrophe of the error message was changed too? – Amessihel Jul 02 '21 at 20:55
  • @GSerg, thank you I think I got it. Btw the two first hexdump were performed with the text copy/pasted from a TextBox into the Cygwin console, I suppose string are output in UTF-8 in GUI content on the machine 2 (having the apostrophe encoded on three bytes)? – Amessihel Jul 02 '21 at 21:02
  • @Amessihel Why wouldn't you want to set `Session`/`Response` to output as UTF-8 on all machines and just be done with it? `Session.LCID` affects number/currency formatting, whereas `(Session|Response).CodePage` affects character encoding. `(Session|Response).Charset` merely tells the browser what encoding is being used. – Remy Lebeau Jul 02 '21 at 21:03
  • @RemyLebeau, we deal with a lot of content encoded with Windows-1252. – Amessihel Jul 02 '21 at 21:05
  • @GSerg, I'm adding a community answer, the explanation of why I get UTF-8 string on copypasting text from VB6 TextBoxes is the same than IO mechanisms? – Amessihel Jul 02 '21 at 21:18
  • @Amessihel Yes. While VB6 `String`s are Unicode, VB6 controls [aren't](https://stackoverflow.com/a/14081257/11683). The textboxes are not unicode themselves, and store text on the clipboard in the machine's ANSI codepage too. Putting a Unicode `String` into a text box triggers the IO conversion. You would not get a UTF-8 out of that though, only the machine's current ANSI codepage (the "Language for the non-Unicode programs"). – GSerg Jul 02 '21 at 21:27
  • 1
    I couldn't be, UTF-8 support for non-Unicode programs is [in beta stage](https://superuser.com/a/1451686/52365) and is not available in Windows 2012, let alone Windows 2008. What is the complete path that this piece of text is going through before you examine that it is UTF-8? – GSerg Jul 02 '21 at 21:38
  • @GSerg, you're right, I saw with some tests I can't rely on encoding preservation while copy pasting into a Cygwin console, a conversion to UTF-8 is somewhat done systematically. Thanks! – Amessihel Jul 02 '21 at 21:46
  • @RemyLebeau, just to be clear: the whole site contains both static and dynamic content. Also I don't have the full scope control on it. That's why I need to keep the whole content encoded in Windows-1252 to preserve consistency. – Amessihel Jul 02 '21 at 22:06
  • The more I think about this question, the more I wonder if it should not be deleted, it won't be useful for further readers... It is not a question of good quality – Amessihel Jul 02 '21 at 22:26

1 Answers1

1

After a extended discussion in the comment section with Remy Lebeau and GSerg, here are the keys to understand the differences:

  • String are handled in UTF-16 by VB6, but the way they are output depends on the context
  • The error messages can vary between implementations (which could explain why the apostrophe changed from L'URL to L’URL)
Amessihel
  • 5,891
  • 3
  • 16
  • 40