4

I want to convert characters to numerical codes, so I tried string.byte("å"). However, it seems that the return value of string.byte() is 195 for these kind of characters;

any way to get a numerical code of non-ascii characters like:?

à,á,â,ã,ä,å

I'm using pure lua;

wiki
  • 1,877
  • 2
  • 31
  • 47
  • 7
    Its UTF-8 code is `195,165` (two bytes), it can be obtained by `print(string.byte("å",1,-1))` – Egor Skriptunoff Jun 12 '14 at 17:43
  • possible duplicate of [What is Unicode, UTF-8, UTF-16?](http://stackoverflow.com/questions/2241348/what-is-unicode-utf-8-utf-16) – Deduplicator Jun 12 '14 at 17:49
  • 1
    A Lua string is a counted sequence of bytes. What you put in those bytes, in this case, is between you and your code editor. – Tom Blodget Jun 12 '14 at 22:14
  • Your question is a bit unclear. You have used saved your script with a UTF-8 encoding. [@YuHao](http://stackoverflow.com/a/24196142/2226988) shows how the retrieve the variable number of bytes for each character in a string. But, do you actually want the codepoints for the characters? For å in Unicode, it would be [229](http://www.fileformat.info/info/unicode/char/e5/index.htm). – Tom Blodget Jun 13 '14 at 01:04

1 Answers1

4

Lua thinks a string is a sequence of bytes, but a Unicode character may contain multiple bytes.

Assuming the string is has valid UTF-8 encoding, you can use the pattern "[\0-\x7F\xC2-\xF4][\x80-\xBF]*" to match a single UTF-8 byte sequence. (In Lua 5.1, use "[%z\1-\127\194-\244][\128-\191]*"), and then get its numerical codes:

local str = "à,á,â,ã,ä,å"

for c in str:gmatch("[\0-\x7F\xC2-\xF4][\x80-\xBF]*") do
    print(c:byte(1, -1))
end

Output:

195 160
44
195 161
44
195 162
44
195 163
44
195 164
44
195 165

Note that 44 is the encoding for the comma.

Yu Hao
  • 119,891
  • 44
  • 235
  • 294
  • Lua defines a string as a counted sequence of bytes. The spec defines it so. If you want something else, you took the wrong data-type. – Deduplicator Jun 13 '14 at 01:15
  • @Deduplicator You mean: because native Lua string doesn't support Unicode (yet), then don't try to solve Unicode problems with Lua? Why not if the solution is so simple? – Yu Hao Jun 13 '14 at 01:43
  • That's not what I meant. I just said that the Lua documentation cannot be mistaken in what a string is, because it is the authority responsible for the definition (at least with regards to Lua). The fact that a byte-string is not restricted to valid UTF-8 (any UTF-8 string is a valid Lua string though), nor is just an interface to unicode codepoints or graphemes or grapheme-clusters does not change anything. Just change the first sentence, and it's ok. BTW: Changing the Lua string to be restricted to Unicode and enforcing Unicode semantics would make it useless in many contexts. – Deduplicator Jun 13 '14 at 14:35