0

I am trying to parse a string (returned by a web server), which contains non-standard (as far as I can tell) unicode Id's such as "\Ud83c" or "\U293c", as well as plain text. I need to display this string, emojis in tact, to the user in a datagrid view.
btw, I am blind so please excuse any formatting errors :(

full example of what my code is parsing: "Castle: \Ud83d\Udc40Jerusal\U00e9m.Miles" the code I wrote which is failing miserably:

Public Function ParseUnicodeId(LNKText As String) As String
    Dim workingarray() As String
    Dim CurString As String
    Dim finalString As String
    finalString = ""
    ' split at \ char
    workingarray = Split(LNKText, chr(92))
    For Each CurString In workingarray
        If CurString <> "" Then
            ' remove leading U so number can be converted to hex
            CurString = Right(CurString, Len(CurString) - 1)
            ' attempt to cut off right most chars until number can be converted to text as there is nothign separating end of Unicode chars and start of plain text
            Do While IsNumeric(CurString) = False
                If CurString = "" Then
                    Exit Do
                End If
                CurString = Left(CurString, Len(CurString) - 1)
            Loop
            If CurString.StartsWith("U", StringComparison.InvariantCultureIgnoreCase) Then
                CurString = CurString.Substring(1)
            End If
            ' convert result from above to hex
            Dim numeric = Int32.Parse(CurString, NumberStyles.HexNumber)
            ' convert to bytes
            Dim bytes = BitConverter.GetBytes(numeric)
            ' convert resulting bytes to a real char for display
            finalString = finalString & Encoding.Unicode.GetString(bytes)
        End If
    Next
    ParseUnicodeId = finalString
End Function

I tried to do this all kinds of ways; but can't seem to get it right. My code currently returns empty strings, although my guess is that is because of some of the more recent changes I have made to cut off the leading U or to try and chop off one char at a time. If I take those bits out and just pass it something like "Ud83c", it works perfectly; its only when plain text is mixed in that it fails, but I can't seem to come up with a way to separate the two and re-combine at the end.

djv
  • 15,168
  • 7
  • 48
  • 72

1 Answers1

0

You can use Regex.Unescape() to convert the unicode escaped char (\uXXXX) to a string.
If you receive \U instead of \u, you also need to perform that substitution, since \U is not recognized as a valid escape sequence.

Dim input as String = "Castle: \Ud83d\Udc40Jerusal\U00e9m.Miles"
Dim result As String = Regex.Unescape(input.Replace("\U", "\u")).

This prints (it may depend on the Font used):

Castle: Jerusalém.Miles

As a note, you might also have used the wrong encoding when you decoded the input stream.

Jimi
  • 29,621
  • 8
  • 43
  • 61
  • `Unescape` requires the input to be properly escaped to begin with. Converting "\U" to "\u" is only one sanitizing step. – TnTinMn Apr 22 '19 at 17:20
  • absolutely perfect. Can't believe my days of googling diddn't find this, but I guess its kinda of a obscure thing to search for. Either way, many thanks – Nem Novakovic Apr 22 '19 at 17:25
  • @TnTinMn I think it should be decoded properly to begin with. Anyway, if these *strings* come from a known source, escaping directly is enough. If the source is unknown, the answers you linked don't even begin to scratch the surface of what's actually required to validate those streams. – Jimi Apr 22 '19 at 17:34