5

Given an Elixir bitstring encoded in UTF-16LE:

<<68, 0, 101, 0, 118, 0, 97, 0, 115, 0, 116, 0, 97, 0, 116, 0, 111, 0, 114, 0, 0, 0>>

how can I get this converted into a readable Elixir String (it spells out "Devastator")? The closest I've gotten is transforming the above into a list of the Unicode codepoints (["0044", "0065", ...]) and trying to prepend the \u escape sequence to them, but Elixir throws an error since it's an invalid sequence. I'm out of ideas.

user701847
  • 337
  • 3
  • 15
  • You've already [answered](http://stackoverflow.com/a/39601246/3102718) this question, don't you? – Oleksandr Avoiants Sep 29 '16 at 14:59
  • That was a temporary hack, and for more complex situations e.g. parsing a string of an unknown length that's terminated by a null byte, it was insufficient. – user701847 Sep 29 '16 at 15:13

2 Answers2

10

The simplest way is using functions from the :unicode module:

:unicode.characters_to_binary(utf16binary, {:utf16, :little})

For example

<<68, 0, 101, 0, 118, 0, 97, 0, 115, 0, 116, 0, 97, 0, 116, 0, 111, 0, 114, 0, 0, 0>>
|> :unicode.characters_to_binary({:utf16, :little})
|> IO.puts
#=> Devastator

(there's a null byte at the very end, so the binary display instead of string will be used in the shell, and depending on OS it may print some extra representation for the null byte)

michalmuskala
  • 11,028
  • 2
  • 36
  • 47
  • Ah, wow...I had actually looked around in the Erlang libraries, specifically `binary` to see if any of those methods would help me, but completely neglected to scroll down the page and see the Unicode one...thanks! – user701847 Sep 29 '16 at 15:07
  • 1
    This is nice! I didn't know `:unicode.characters_*` functions also accepted binaries. @user701847 you should probably accept this answer instead of mine. – Dogbert Sep 29 '16 at 15:24
1

You can make use of Elixir's pattern matching, specifically <<codepoint::utf16-little>>:

defmodule Convert do
  def utf16le_to_utf8(binary), do: utf16le_to_utf8(binary, "")

  defp utf16le_to_utf8(<<codepoint::utf16-little, rest::binary>>, acc) do
    utf16le_to_utf8(rest, <<acc::binary, codepoint::utf8>>)
  end
  defp utf16le_to_utf8("", acc), do: acc
end

<<68, 0, 101, 0, 118, 0, 97, 0, 115, 0, 116, 0, 97, 0, 116, 0, 111, 0, 114, 0, 0, 0>>
|> Convert.utf16le_to_utf8
|> IO.puts

<<192, 3, 114, 0, 178, 0>>
|> Convert.utf16le_to_utf8
|> IO.puts

Output:

Devastator
πr²
Dogbert
  • 212,659
  • 41
  • 396
  • 397
  • 1
    Ah, that's what I was missing, thank you! I had never taken `codepoint` and then matched it like `codepoint::utf8`; I basically didn't know what to do with the 2 bytes. To make yours even simpler, we can just do: `for << codepoint::utf16-little <- binary >>, into: "", do: <` – user701847 Sep 29 '16 at 15:10