2

I need to script my app (not a game) and I have a problem, choosing a script lang for this. Lua looks fine (actually, it is ideal for my task), but it has problems with unicode strings, which will be used. Also, I thought about Python, but I don't like It's syntax, and it's Dll is too big for me ( about 2.5 Mib). Python and other such langs have too much functions, battaries and modules which i do not need (e.g. I/O functions) - script just need to implement logic, all other will do my app. So, I'd like to know is there a scripting lang, which satisfies this conditions:

  • unicode strings
  • I can import C++ functions and then call them from script
  • Can be embedded to app (no dll's) without any problems

Reinventing the wheel is not a good idea, so I don't want to develop my own lang. Or there is a way to write unicode strings in Lua's source? Like in C++ L"Unicode string"

Ivan
  • 609
  • 8
  • 21
  • 2
    If you define UTF-8 encoded strings inside a LUA script, for example `"開発"` and save the file as UTF-8, it will work. However, to actually do something with that string in LUA, you'd have to use a string library like [slnunicode](https://github.com/LuaDist/slnunicode). If you only need this as an interface with your application, maybe this is sufficient for your needs (you would have to encode and decode data where necessary, but that's a low price to be paid, if LUA is otherwise a good fit for you). – Niklas B. Apr 21 '12 at 19:19
  • I'm not familiar with Lua, but if you can import C++ functions then why not import a unicode library for C++. eg. ICU -- http://site.icu-project.org/ – Dunes Apr 21 '12 at 19:45
  • I need to pass code to LuaVM using luaL_loadbuffer, which accepts char* buffers. And if I pass a string with, for example, umlaut I need this not to be converted to another char. – Ivan Apr 21 '12 at 19:55

5 Answers5

7

Lua strings are encoding-agnostic. So, yes, you can write unicode strings in Lua scripts. If you need pattern matching, then the standard Lua string library does not support unicode classes. But plain substring search works.

lhf
  • 70,581
  • 9
  • 108
  • 149
  • You mean that I can pass a unicode string to lua using luaL_loadbuffer? It seems, that it accepts a char* buffer. – Ivan Apr 21 '12 at 19:51
  • Yes, you can, if your unicode string is stored in a byte char array. – lhf Apr 21 '12 at 20:25
  • @Ivan: Note: string literals can contain any series of bytes; that's how Lua works. But actual name literals *must* be ASCII in to Lua 5.2. So you can use string literals in Lua code that have an umlaut or whatever, but you can't have variable names that use them. – Nicol Bolas Apr 21 '12 at 23:16
  • I do not need to have unicode literals, just strings. So, how can I store unicode string in a 8-bit string? I know about \xxx, but it works only with numbers less then 256 – Ivan Apr 22 '12 at 10:53
  • 1
    @Ivan, there are no unicode escape sequences, if that is what you mean. But see http://lua-users.org/wiki/LuaUnicode. – lhf Apr 22 '12 at 12:27
5

There isn't really such a thing as a "unicode string". Strings are a sequence of bytes that can contain anything. Knowing the encoding of the data in the string matters, though.

I use Lua with UTF-8 strings, which just works for all the operations I care about. I do not use any Unicode string library, though those are available for Lua (ICU4Lua, slnunicode, etc.).

Some notes about using UTF-8 strings in Lua:

  • String length (# operator) returns the string length in bytes, not characters or codepoints (non-ASCII characters may be sequences of multiple bytes).
  • String splitting (e.g. string.sub) must not split up UTF-8 sequences.
  • String matching works (string.find, string.match) fine with ASCII patterns.
  • Substring searching (such as string.find in 'plain' mode) works with UTF-8 as the needle or the haystack.

Counting codepoints in UTF-8 is quite straightforward, if slightly less efficient than other encodings. For example in Lua:

function utf8_length(str)
        return select(2, string.gsub(str, "[^\128-\193]", ""));
end

If you need more than this kind of thing, the unicode libraries I mentioned give you APIs for everything, including conversion between encodings.

Personally I prefer this straightforward approach to any of the languages that force a certain flavour of unicode on you (such as Javascript) or try and be clever by having multiple encodings built into the language (such as Python). In my experience they only cause headaches and performance bottlenecks.

In any case, I think every developer should have a good basic understanding of how unicode works, and the principle differences between different encodings so that they can make the best choice about how to handle unicode in their application.

For example if all your existing strings in your application are in a wide-char encoding, it would be much less convenient to use Lua as you would have to add a conversion to every string in and out of Lua. This is entirely possible, but if your app might be CPU-bound (as in a game) then it would be a negative point performance-wise.

MattJ
  • 7,924
  • 1
  • 28
  • 33
  • Ok, I see. But how can I store uncode chars in 8-bit string? This is the only question I have. In C++ I can write L"unicŏde". I Lua there are only "non-unicode". Is I have to use codes instead? Unicode is encoded in 2 bytes, so what should I do? I do not need to manipulate with uncode sting, only pass it to C++ function, but unicode string should be clearly stated in a Lua code (e.g filename like C:\sŏme\some\file.txt) and script should successfully pass this script to C++ function, which will process this string itself. String should be clearly stated in a script (not a file) – Ivan Apr 22 '12 at 20:32
  • 1
    Unicode encoded in 2 bytes is called UTF-16 (16 bits). UTF-8 is a method for splitting a unicode character into a sequence of 8-bit bytes. This allows automatic compatibility with anything that works with the string within the limits explained in my answer above. – MattJ Apr 23 '12 at 00:17
1

Have a look at JavaScript - the V8 engine is pretty powerful and JavaScript does not come with a big stdlib. Besides that, you can easily embed it and from what I know it handles unicode fine.

ThiefMaster
  • 310,957
  • 84
  • 592
  • 636
0

Have a look at Io.

It's unicode all the way down and embeddable. Also it seems to provide some C++ binding library.

Community
  • 1
  • 1
draegtun
  • 22,441
  • 5
  • 48
  • 71
0

Take look at Jim Tcl. It's small, easily embeddable and extendable, supports UTF-8 strings, and it's pretty powerful

Colin Macleod
  • 4,222
  • 18
  • 21