5

I need to convert the string HTML from a mix of Cyrillic and Latin symbols to UNICODE.

I tried the following:

Public HTML As String
    Sub HTMLsearch()

    GetHTML ("http://nfs.mobile.bg/pcgi/mobile.cgi?act=3&slink=6jkjov&f1=1")
    MsgBox HTML
    HTML = StrConv(HTML, vbUnicode)
    MsgBox HTML
End Sub

Function GetHTML(URL As String) As String
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", URL, False
        .Send
        HTML = .ResponseText
    End With
End Function

You can see what is before and after the StrConv. If you like to get the html in a file, you can use the following code:

Public HTML As String
    Sub HTMLsearch()

    GetHTML ("http://nfs.mobile.bg/pcgi/mobile.cgi?act=3&slink=6jkjov&f1=1")

    Dim path As String

    path = ThisWorkbook.path & "\html.txt"
    Open path For Output As #1
    Print #1, HTML
    Close #1

    HTML = StrConv(HTML, vbUnicode)

    path = ThisWorkbook.path & "\htmlUNICODE.txt"
    Open path For Output As #1
    Print #1, HTML
    Close #1
End Sub

Function GetHTML(URL As String) As String
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", URL, False
        .Send
        HTML = .ResponseText
    End With
End Function

IDEAS?

Sandra Rossi
  • 11,934
  • 5
  • 22
  • 48
Trenera
  • 1,435
  • 7
  • 29
  • 44
  • It would help to explain what the code does (or doesn't do) in relation to what you expect/want it to do. i.e. *what is the actual problem here*? – Tim Williams May 22 '14 at 16:28

2 Answers2

13

VBA's support for Unicode is not all that great.

It is possible to handle Unicode strings, but you will not be able to see the actual characters with Debug.Print or MsgBox - they will appear as ? there.

You can set Control Panel > Region and Language > Administrative tab > "Current language for non-Unicode programs" to "Russian" switch to a different code page, which would allow you to see Cyrillic letters in VBA message boxes instead of question marks. But that's only a cosmetic change.


Your real problem is something else here.

The server (nfs.mobile.bg) sends the document as Content-Type: text/html. There is no information about character encoding. That means the receiver must figure out character encoding on its own.

A browser does that by looking at the response byte stream and making guesses. In your case, a helpful <meta http-equiv="Content-Type" content="text/html; charset=windows-1251"> tag is present in the HTML source. Therefore, the byte stream should be interpreted as Windows-1251, which happens to be the Cyrillic ANSI code page in Windows.

So, we do not even have Unicode here!

In the absence of any additional info, the responseText property of the XMLHTTP object defaults to us-ascii. The extended characters from the Cyrillic alphabet are not present in ASCII, so they will be converted to actual question marks and are lost. That's why you can't use responseText for anything.

However, the original bytes of the response are still available, in the responseBody property, which is an array of Byte.

In VBA you must do the same thing a browser would do. You must interpret the byte-stream as a certain character set. The ADODB.Stream object can do that for you, and it's pretty straight-forward, too:

' reference: "Microsoft XML, v6.0" (or any other version)
' reference: "Microsoft ActiveX Data Objects 6.1 library" (or any other version)
Option Explicit

Sub HTMLsearch()
    Dim url As String, html As String
    
    url = "http://nfs.mobile.bg/pcgi/mobile.cgi?act=3&slink=6jkjov&f1=1"
    html = GetHTML(url, "Windows-1251")
    
    ' Cyrillic characters are supported in Office, so they will appear correctly
    ActiveDocument.Range.InsertAfter html
End Sub

Function GetHTML(Url As String, Optional Charset As String = "UTF-8") As String
    Dim request As New MSXML2.XMLHTTP
    Dim converter As New ADODB.stream
    
    ' fetch page
    request.Open "GET", Url, False
    request.send
    
    ' write raw bytes to the stream
    converter.Open
    converter.Type = adTypeBinary
    converter.Write request.responseBody
    
    ' switch the stream to text mode and set charset
    converter.Position = 0
    converter.Type = adTypeText
    converter.Charset = Charset
    
    ' read text characters from the stream, close the stream
    GetHTML = converter.ReadText
    converter.Close
End Function

I've been using MS Word here and calling HTMLsearch() properly writes Cyrillic characters to the page. They still do appear as ? in a MsgBox for me, though, but now that's purely a display problem, caused by the fact that VBA-created UI cannot deal with Unicode.

user202729
  • 3,358
  • 3
  • 25
  • 36
Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • 2
    @ Tomalak : I don't believe I've seen a better answer and explanation. Thank you very much for the support! – Trenera May 23 '14 at 08:22
  • @Vihar The final bit of info: Internally, all VBA strings are Unicode (UTF-16, just like all strings in Windows). So the result of `converter.ReadText` actually *already is* a Unicode string. What `converter.Charset = "Windows-1251"` effectively does is informing the stream *which ANSI code page it needs to use* to convert the contained bytes to Unicode. – Tomalak May 23 '14 at 09:11
  • 1
    e.g. byte `0xC1` in the input → interpreted as character `Б` using code page 1251 → mapped to `U+0431` (CYRILLIC SMALL LETTER BE) in Unicode (UTF-16 representation in memory: 2 bytes `0x0431`) – Tomalak May 23 '14 at 09:19
  • Thats a great answer from Tomalak, and it seemed awesome, I got it to run, but it did not work in the end on my production data that contains 100s of languages. see my answer below that did work. – hamish Apr 22 '15 at 08:04
  • This solution can only work in Windows. Excel for Mac can't do it. It doesn't understand MSXML2.XMLHTTP, for one. – Br.Bill Jun 06 '18 at 00:25
  • @Br.Bill That's true, the whole solution is geared towards Windows. On Mac Office, alternative approaches of fetching data via HTTP must be used. Conditional compilation (`#If Mac Then` ... `#Else` ... `#End If`) could be used to integrate them into the same codebase. – Tomalak Jun 06 '18 at 06:23
  • 1
    @Br.Bill Here's a issue/workaround thread in a repository which looks quite useful in general: https://github.com/VBA-tools/VBA-Web/issues/65 – Tomalak Jun 06 '18 at 06:31
  • 1
    Speak about an educating answer. This was the direct method to read a Russian site - though the _charset_ was UTF-8, but that did you also explain. – Gustav Oct 08 '18 at 14:30
5

My production order data comes from many countries. this is the only vba function I could find that really works.

Private Const CP_UTF8 = 65001

Private Declare Function MultiByteToWideChar Lib "kernel32" ( _
   ByVal CodePage As Long, ByVal dwFlags As Long, _
   ByVal lpMultiByteStr As Long, ByVal cchMultiByte As Long, _
   ByVal lpWideCharStr As Long, ByVal cchWideChar As Long) As Long


Public Function sUTF8ToUni(bySrc() As Byte) As String
   ' Converts a UTF-8 byte array to a Unicode string
   Dim lBytes As Long, lNC As Long, lRet As Long

   lBytes = UBound(bySrc) - LBound(bySrc) + 1
   lNC = lBytes
   sUTF8ToUni = String$(lNC, Chr(0))
   lRet = MultiByteToWideChar(CP_UTF8, 0, VarPtr(bySrc(LBound(bySrc))), lBytes, StrPtr(sUTF8ToUni), lNC)
   sUTF8ToUni = Left$(sUTF8ToUni, lRet)
End Function

Example Usage:

Dim sHTML As String
Dim bHTML() As Byte
bHTML = GetHTML("http://yoururlhere/myorderdata.php")
sHTML = sUTF8ToUni(bHTML)
sHTML = Mid(sHTML, 2)  'strip off Byte Order Mark: EF BB BF
hamish
  • 1,141
  • 1
  • 12
  • 21
  • 1
    Can you explain what you mean by "doesn't work" and give an example where my function fails? – Tomalak Apr 22 '15 at 08:14
  • 1
    This solution can only work in Windows. Excel for Mac can't do it. It doesn't have the kernel32 library. – Br.Bill Jun 06 '18 at 00:25