9

I apologize for this silly question. I am maintaining old legacy VB6 code, and I have a function that actually works - but I simply can't figure out why it works, or why the code doesn't work without it.

Basically, this function reads a UTF-8 text file and displays its contents in a DHTMLEdit component. The way it goes about it, is that it reads the entire file into a string, then converts it from a double byte to a multibyte string using the ANSI codepage, then converts it back to double byte.

Using this entire elaborate mechanism causes the component to correctly display a page that has Hebrew, Arabic, Thai and Chinese, all at the same time. Not using this code makes the text look like it was converted down to ASCII, showing various punctuation marks where letters once were.

What I don't understand is:

  1. Since the original file is UTF-8 and VB6 strings are UTF-16, why is this even needed? Why doesn't VB6 read the string correctly from the file without all these conversions?
  2. If the function converts from widebyte to multibyte using CodePage = 0 (ANSI), wouldn't that eliminate any characters that are not supported by the current codepage? I don't even have Chinese, Thai and Arabic installed on this station. And yet this is the only way that I can get the DHTMLEdit control to display correctly.

[code]

Private Declare Function MultiByteToWideChar Lib "kernel32" (ByVal codePage As Long, ByVal dwFlags As Long, ByVal lpMultiByteStr As Long, ByVal cchMultiByte As Long, ByVal lpWideCharStr As Long, ByVal cchWideChar As Long) As Long
Private Declare Function WideCharToMultiByte Lib "kernel32" (ByVal codePage As Long, ByVal dwFlags As Long, ByVal lpWideCharStr As Long, ByVal cchWideChar As Long, ByVal lpMultiByteStr As Long, ByVal cchMultiByte As Long, ByVal lpDefaultChar As Long, lpUsedDefaultChar As Long) As Long
Private Declare Function GetACP Lib "kernel32" () As Long


...
Open filePath For Input As #lFilePtr
Dim sInput    as String
dim sResult   as string

Do While Not EOF(lFilePtr)
    Line Input #lFilePtr, sInput
    sResult = sResult + sInput;
Loop
txtBody.DOM.Body.innerText = DecodeString(sResult, CP_UTF8);

Public Function DecodeString(ByVal strSource As String, Optional FromCodePage As Long = -1) As String
    Dim strTemp As String

    If strSource = vbNullString Then Exit Function
    strTemp = UnicodeToAnsi(strSource, 0)
    DecodeString = AnsiToUnicode(strTemp, FromCodePage)
End Function

Public Function AnsiToUnicode(ByVal strSource As String, Optional ByVal codePage As Long = -1, Optional lFlags As Long = 0) As String
    Dim strBuffer As String
    Dim cwch As Long
    Dim pwz As Long
    Dim pwzBuffer As Long

    If codePage = -1 Then codePage = GetACP()
    pwz = StrPtr(strSource)
    cwch = MultiByteToWideChar(codePage, lFlags, pwz, -1, 0&, 0&)
    strBuffer = String$(cwch + 1, vbNullChar)
    pwzBuffer = StrPtr(strBuffer)
    cwch = MultiByteToWideChar(codePage, lFlags, pwz, -1, pwzBuffer, Len(strBuffer))
    AnsiToUnicode = Left(strBuffer, cwch - 1)
End Function

Public Function UnicodeToAnsi(ByVal strSource As String, Optional ByVal codePage As Long = -1, Optional lFlags As Long = 0) As String
    Dim strBuffer As String
    Dim cwch As Long
    Dim pwz As Long
    Dim pwzBuffer As Long

    If codePage = -1 Then codePage = GetACP()
    pwz = StrPtr(strSource)
    cwch = WideCharToMultiByte(codePage, lFlags, pwz, -1, 0&, 0&, ByVal 0&, ByVal 0&)
    strBuffer = String$(cwch + 1, vbNullChar)
    pwzBuffer = StrPtr(strBuffer)
    cwch = WideCharToMultiByte(codePage, lFlags, pwz, -1, pwzBuffer, Len(strBuffer), ByVal 0&, ByVal 0&)
    UnicodeToAnsi = Left(strBuffer, cwch - 1)
End Function

[code]

Dmitry Pavliv
  • 35,333
  • 13
  • 79
  • 80
user884248
  • 2,134
  • 3
  • 32
  • 57

1 Answers1

9

VB6/A uses implicit two-way UTF16-ASCII translation when reading / writing files using built-in operators.

Line Input treats the file as being in ASCII (a series of bytes, each represents a character), using the current system codepage for non-Unicode programs. The read characters are converted to UTF-16.

When you read a UTF-8 file in this way, what you get is an "invalid" string - you can't use it directly in the language (if you try you will see garbage), but it contains usable binary data.

Then the pointer to that usable binary data is passed to WideCharToMultiByte (in UnicodeToAnsi), which results in another "invalid" string being created - this time it contains "ASCII" data. Effectively this reverts the conversion VB does automatically with Line Input, and because the original file was in UTF-8, you now have an "invalid" string with UTF-8 data in it, although the conversion function thought it was converting to ASCII.

The pointer to that second invalid string is passed to MultiByteToWideChar (in AnsiToUnicode) that finally creates a valid string that can be used in VB.

The confusing part about this code is that strings are used to contain the "invalid" data. Logically all these should have been arrays of bytes. I would refactor the code to read bytes from the file in the binary mode and pass the array to MultiByteToWideChar directly.

GSerg
  • 76,472
  • 17
  • 159
  • 346
  • GSerg - you're a genius. I had a feeling it was some backscene conversion going on but it just didn't make sense to me. Reading a file as ASCII and converting it to UTF-16? Wow. You are absolutely right about refactoring, unfortunately this is part of a much larger class that reads and parses emails, and the Line Input command is at the core of this class. Re-writing everything to use byte arrays would be too much hassle and not very efficient, I would have to keep converting back and forth. Thank you so much for helping me out! I'm sure this information will be useful for others as well. – user884248 Jun 01 '14 at 13:04
  • @user884248 In that class do you really have the invalid strings stored as first-class data? I.e. in the code you show they are only used to create and store decoded string, and it is a valid string and should be stored as string. It does not appear you'd need to convert back and forth. – GSerg Jun 01 '14 at 13:12
  • unfortunately yes. The code reads a line and then starts checking its contents. Even if I didn't have to convert back and forth, it would mean re-writing this entire class, and I have no unit tests for it. It has worked so far, so I prefer to leave it as it is. – user884248 Jun 01 '14 at 13:28
  • 2
    +1 You've correctly said VB6 file encoding depends on the Windows code page. The encoding is often called "ANSI" in the Windows documentation - it's a misleading term but it's commonly used. It's probably better than ASCII. Minor quibble - you've said that every byte represents a character but on some codepages there are multiple bytes for a character e.g. Simplified Chinese. This leads me to note that original code described by the OP has another problem - it might fail for some code pages. On some code pages, it's possible that a stream of UTF8 bytes won't be a valid stream of "ANSI" bytes. – MarkJ Jun 02 '14 at 11:21