lua - string.byte for non ascii characters

Question

I want to convert characters to numerical codes, so I tried string.byte("å"). However, it seems that the return value of string.byte() is 195 for these kind of characters;

any way to get a numerical code of non-ascii characters like:?

à,á,â,ã,ä,å

I'm using pure lua;

Its UTF-8 code is `195,165` (two bytes), it can be obtained by `print(string.byte("å",1,-1))` — Egor Skriptunoff, Jun 12 '14 at 17:43
possible duplicate of [What is Unicode, UTF-8, UTF-16?](http://stackoverflow.com/questions/2241348/what-is-unicode-utf-8-utf-16) — Deduplicator, Jun 12 '14 at 17:49
A Lua string is a counted sequence of bytes. What you put in those bytes, in this case, is between you and your code editor. — Tom Blodget, Jun 12 '14 at 22:14
Your question is a bit unclear. You have used saved your script with a UTF-8 encoding. [@YuHao](http://stackoverflow.com/a/24196142/2226988) shows how the retrieve the variable number of bytes for each character in a string. But, do you actually want the codepoints for the characters? For å in Unicode, it would be [229](http://www.fileformat.info/info/unicode/char/e5/index.htm). — Tom Blodget, Jun 13 '14 at 01:04

score 4 · Accepted Answer · answered Jun 13 '14 at 00:48

4

Lua thinks a string is a sequence of bytes, but a Unicode character may contain multiple bytes.

Assuming the string is has valid UTF-8 encoding, you can use the pattern "[\0-\x7F\xC2-\xF4][\x80-\xBF]*" to match a single UTF-8 byte sequence. (In Lua 5.1, use "[%z\1-\127\194-\244][\128-\191]*"), and then get its numerical codes:

local str = "à,á,â,ã,ä,å"

for c in str:gmatch("[\0-\x7F\xC2-\xF4][\x80-\xBF]*") do
    print(c:byte(1, -1))
end

Output:

Note that 44 is the encoding for the comma.

answered Jun 13 '14 at 00:48

Yu Hao

119,891
44
235
294

Lua defines a string as a counted sequence of bytes. The spec defines it so. If you want something else, you took the wrong data-type. – Deduplicator Jun 13 '14 at 01:15
@Deduplicator You mean: because native Lua string doesn't support Unicode (yet), then don't try to solve Unicode problems with Lua? Why not if the solution is so simple? – Yu Hao Jun 13 '14 at 01:43
That's not what I meant. I just said that the Lua documentation cannot be mistaken in what a string is, because it is the authority responsible for the definition (at least with regards to Lua). The fact that a byte-string is not restricted to valid UTF-8 (any UTF-8 string is a valid Lua string though), nor is just an interface to unicode codepoints or graphemes or grapheme-clusters does not change anything. Just change the first sentence, and it's ok. BTW: Changing the Lua string to be restricted to Unicode and enforcing Unicode semantics would make it useless in many contexts. – Deduplicator Jun 13 '14 at 14:35

lua - string.byte for non ascii characters

1 Answers1