There isn't really such a thing as a "unicode string". Strings are a sequence of bytes that can contain anything. Knowing the encoding of the data in the string matters, though.
I use Lua with UTF-8 strings, which just works for all the operations I care about. I do not use any Unicode string library, though those are available for Lua (ICU4Lua, slnunicode, etc.).
Some notes about using UTF-8 strings in Lua:
- String length (# operator) returns the string length in bytes, not characters or codepoints (non-ASCII characters may be sequences of multiple bytes).
- String splitting (e.g. string.sub) must not split up UTF-8 sequences.
- String matching works (string.find, string.match) fine with ASCII patterns.
- Substring searching (such as string.find in 'plain' mode) works with UTF-8 as the needle or the haystack.
Counting codepoints in UTF-8 is quite straightforward, if slightly less efficient than other encodings. For example in Lua:
function utf8_length(str)
return select(2, string.gsub(str, "[^\128-\193]", ""));
end
If you need more than this kind of thing, the unicode libraries I mentioned give you APIs for everything, including conversion between encodings.
Personally I prefer this straightforward approach to any of the languages that force a certain flavour of unicode on you (such as Javascript) or try and be clever by having multiple encodings built into the language (such as Python). In my experience they only cause headaches and performance bottlenecks.
In any case, I think every developer should have a good basic understanding of how unicode works, and the principle differences between different encodings so that they can make the best choice about how to handle unicode in their application.
For example if all your existing strings in your application are in a wide-char encoding, it would be much less convenient to use Lua as you would have to add a conversion to every string in and out of Lua. This is entirely possible, but if your app might be CPU-bound (as in a game) then it would be a negative point performance-wise.