3

I have a very complex problem that is difficult to explain properly. There is LOTS of discussion about this across the internet, but nothing definitive. Any help, or better explanation than mine, is greatly appreciated.

Essentially, I'm just trying to write an XML file using UTF-16 with VBA.

If I do this:

sXML = "<?xml version='1.0' encoding='utf-8'?>"
sXML = sXML & rest_of_xml_document
Print #iFile, sXML

then I get a file that is valid XML. However, if I change the "encoding=" to "utf-16", I get this error from my XML validator:

Switch from current encoding to specified encoding not supported.

Googling tells me that this means the xml encoding attribute is different to the ACTUAL encoding used by the file, hence I must be creating a utf-8 document via Open and Print commands.

If I do something like:

With CreateObject("ADODB.Stream")
  .Type = 2
  .Charset = "utf-16"
  .Open
  .WriteText sXML
  .SaveToFile sFilename, 2
  .Close
End With

then I end up with some funky characters (the BOM) at the beginning of my file which causes it to fail XML validation.

If I open the file in Notepad++, delete the BOM and change the Encoding to "UCS-2", then the file validates fine with a "utf-16" encoding value (meaning that UCS-2 is close enough to UTF-16 that it doesnt matter, or that XML is able to Switch from current encoding between these two types.

I need to use UTF-16 because UTF-8 doesn't cover all the characters used in the presentations I'm exporting.

The question:

How can I get VBA to behave like Notepad++, creating a UTF-16-encoded text file without a BOM that can be filled with XML data? ANY help much appreciated!

Alex McMillan
  • 17,096
  • 12
  • 55
  • 88

1 Answers1

5

Your point about UTF-8 not being able to store all characters you need is invalid.
UTF-8 is able to store every character defined in the Unicode standard.
The only difference is that, for text in certain languages, UTF-8 can take more space to store its codepoints than, say, UTF-16. The opposite is also true: for certain other languages, such as English, using UTF-8 saves space.

VB6 and VBA, although store strings in memory in Unicode, implicitly switch to ANSI (using the current system code page) when doing file IO. The resulting file you get is NOT in UTF-8. It is in your current system codepage, which, as you can discover in this helpful article, looks just like UTF-8 if you're from USA.

Try:

Dim s As String
s = "<?xml version='1.0' encoding='utf-16'?>"
s = s & ChrW$(&H43F&) & ChrW$(&H440&) & ChrW$(&H43E&) & ChrW$(&H432&) & ChrW$(&H435&) & ChrW$(&H440&) & ChrW$(&H43A&) & ChrW$(&H430&)

Dim b() As Byte
b = s

Open "Unicode.txt" For Binary Access Write As #1
Put #1, , b
Close #1

And if you absolutely must have UTF-8, you can make yourself some:

Option Explicit

Private Declare Function WideCharToMultiByte Lib "kernel32.dll" (ByVal CodePage As Long, ByVal dwFlags As Long, ByVal lpWideCharStr As Long, ByVal cchWideChar As Long, ByRef lpMultiByteStr As Byte, ByVal cchMultiByte As Long, ByVal lpDefaultChar As String, ByRef lpUsedDefaultChar As Long) As Long

Private Const CP_UTF8 As Long = 65001
Private Const ERROR_INSUFFICIENT_BUFFER As Long = 122&


Public Function ToUTF8(s As String) As Byte()

  If Len(s) = 0 Then Exit Function


  Dim ccb As Long
  ccb = WideCharToMultiByte(CP_UTF8, 0, StrPtr(s), Len(s), ByVal 0&, 0, vbNullString, ByVal 0&)

  If ccb = 0 Then
    Err.Raise 5, , "Internal error."
  End If

  Dim b() As Byte
  ReDim b(1 To ccb)

  If WideCharToMultiByte(CP_UTF8, 0, StrPtr(s), Len(s), b(LBound(b)), ccb, vbNullString, ByVal 0&) = 0 Then
    Err.Raise 5, , "Internal error."
  Else
    ToUTF8 = b
  End If

End Function
Sub Test()
  Dim s As String
  s = "<?xml version='1.0' encoding='utf-8'?>"
  s = s & ChrW$(&H43F&) & ChrW$(&H440&) & ChrW$(&H43E&) & ChrW$(&H432&) & ChrW$(&H435&) & ChrW$(&H440&) & ChrW$(&H43A&) & ChrW$(&H430&)

  Dim b() As Byte
  b = ToUTF8(s)

  Open "utf-8.txt" For Binary Access Write As #1
  Put #1, , b
  Close #1
End Sub
GSerg
  • 76,472
  • 17
  • 159
  • 346
  • If this is the case, then why is it that when I use utf-8 the XML fails to validate because of illegal characters? The same XML validates if using utf-16... – Alex McMillan Feb 02 '12 at 01:02
  • @AlexMcMillan How do you use utf-8? By changing 'utf-16' to 'utf-8' in the code snippet above? – GSerg Feb 02 '12 at 01:07
  • Yeah - and I know that's not changing the actual encoding of the file, but it changes the way the validator looks at characters. For example, I'm finding the character ” in some text I'm putting into an XML element. If I set the encoding to "utf-8", it fails. If I set it to "utf-16", it works fine. – Alex McMillan Feb 02 '12 at 01:28
  • @AlexMcMillan It's not about validator. This code writes out UTF-16, that is, at least two bytes per character. If you then say it's UTF-8, then the other byte of each character gets interpreted as a separate character. Because you're using English, that other byte is usually zero, which yields the null character, which isn't allowed in XML. – GSerg Feb 02 '12 at 08:18