First of all, note that there are no functions in Lua's string
library that know anything about Unicode/mutlibyte encodings (source: Programming in Lua, 3rd edition). As far as Lua is concerned, strings are simply made up of bytes. It's up to you to figure out which bytes make up a character, if you are using UTF-8 encoded strings. Therefore, string.len
will give you the number of bytes, not the number of characters. And string.sub
will give you a substring of bytes not a substring of characters.
Some UTF-8 basics:
If you need some refreshing on the conceptual basics of Unicode, you should check out this article.
UTF-8 is one possible (and very important) implementation of Unicode - and probably the one you are dealing with. As opposed to UTF-32 and UTF-16 it uses a variable number of bytes (from 1 to 4) to encode each character. In particular, the ASCII characters 0 to 127 are represented with a single byte, so that ASCII strings can be correctly interpreted using UTF-8 (and vice versa, if you only use those 128 characters). All other characters start with a byte in the range from 194 to 244 (which signals that more bytes follow to encode a full character). This range is further subdivided, so that you can tell from this byte, whether 1, 2 or 3 more bytes follow. Those additional bytes are called continuation bytes and are guaranteed to be only taken from the range from 128 to 191. Therefore, by looking at a single byte we know where it stands in a character:
- If it's in
[0,127]
, it's a single-byte (ASCII) character
- If it's in
[128,191]
, it's part of a longer character and meaningless on its own
- If it's in
[191,244]
, it marks the beginning of a longer character (and tells us how long that character is)
This information is enough to count characters, split a UTF-8 string into characters and do all sorts of other UTF-8-sensitive manipulations.
Some pattern matching basics:
For the task at hand we need a few of Lua's pattern matching constructs:
[...]
is a character class, that matches a single character (or rather byte) of those inside the class. E.g. [abc]
matches either a
, or b
or c
. You can define ranges using a hyphen. Therefore [\33-\127]
for example, matches any single one of the bytes from 33
to 127
. Note that \127
is an escape sequence you can use in any Lua string (not just patterns) to specify a byte by its numerical value instead of the corresponding ASCII character. For instance, "a"
is the same as "\97"
.
You can negate a character class, by starting it with ^
(so that it matches any single byte that is not part of the class.
*
repeats the previous token 0 or more times (arbitrarily many times - as often as possible).
$
is an anchor. If it's the last character of the pattern, the pattern will only match at the end of the string.
Combining all of that...
...your problem reduces to a one-liner:
local function lastChar(s)
return string.match(s, "[^\128-\191][\128-\191]*$")
end
This will match a character that is not a UTF-8 continuation character (i.e., that is either single-byte character, or a byte that marks the beginning of a longer character). Then it matches an arbitrary number of continuation characters (this cannot go past the current character, due to the range chosen), followed by the end of the string ($
). Therefore, this will give you all the bytes that make up the last character in the string. It produces the desired output for all 4 of your examples.
Equivalently, you can use gsub
to remove that last character from your string:
function deleteLastCharacter(s)
return string.gsub(s, "[^\128-\191][\128-\191]*$", "")
end
The match is the same, but instead of returning the matched substring, we replace it with ""
(i.e. remove it) and return the modified string.