4

First question. What's the easiest way in Lua to determine if the last character in a string is not multibyte. Or what's the easiest way to delete the last character from a string.

Here are examples of valid strings, and what I want the output of the function to be

hello there     --- result should be:   hello ther
anñ             --- result should be:   an
כראע            --- result should be:   כרא
ㅎㄹㅇㅇㅅ       --- result should be:   ㅎㄹㅇㅇ

I need something like

function lastCharacter(string)
    --- some code which will extract the last character only ---
    return lastChar
end

or if it's easier

function deleteLastCharacter(string)
--- some code which will output the string minus the last character --- 
    return newString
end

This is the path I was going on

local function lastChar(string)
    local stringLength = string.len(string)
    local lastc = string.sub(string,stringLength,stringLength)
    if lastc is a multibyte character then
        local wordTable = {}
        for word in string:gmatch("[\33-\127\192-\255]+[\128-\191]*") do
            wordTable[#wordTable+1] = word
        end
    lastc = wordTable[#wordTable]
end
    return lastc
end
fun_programming
  • 228
  • 4
  • 15
  • Try using the regular expression `^(.*).$`, then return the first capturing group. I'm not really sure how to do that in Lua, but I'm guessing this'll do. – FrankieTheKneeMan Apr 12 '13 at 20:25
  • I'm sorry: Use the expression `^(.*)(.)$`, then return the first capturing group for delete last character, or the second group to retrieve the last letter. – FrankieTheKneeMan Apr 12 '13 at 20:30
  • Your pattern seems quite, good. Try removing the `+` and adding a `$` at the end. The `+` will make sure that you don't pick up additional single-byte characters, and the `$` anchors your pattern to the end of the string. However, `string.len` will give you the amount of bytes, hence `lastc` will contain only the last byte, not the entire last character. – Martin Ender Apr 12 '13 at 20:45
  • `string.sub(str, stringLength,stringLength)` does indeed return the last character in `str`. Just be sure not to name your variable `string`, as that conflicts with the `string` table. Also, could you elaborate what you mean by multibyte character? – Netfangled Apr 12 '13 at 20:50
  • @Netfangled are you sure? because the most recent edition of Programming in Lua claims the opposite. – Martin Ender Apr 12 '13 at 20:52
  • @m.buettner I just tested it, actually. Not sure if they changed the behaviour in 5.2, but it does work in 5.1. Try it out. – Netfangled Apr 12 '13 at 20:55
  • 1
    @Netfangled yeah no, doesn't work. It returns the last byte. For the third and fourth example, this is not the entire last character (neither for the second one, if it is properly UTF-8 encoded). It really can't work, since Lua's built-in `string` library has no concept of Unicode... its strings simply contain bytes and it's up to you to make sense of them. – Martin Ender Apr 12 '13 at 20:59
  • @m.buettner Well, I never... Learn something new everyday. – Netfangled Apr 12 '13 at 21:03

3 Answers3

9

First of all, note that there are no functions in Lua's string library that know anything about Unicode/mutlibyte encodings (source: Programming in Lua, 3rd edition). As far as Lua is concerned, strings are simply made up of bytes. It's up to you to figure out which bytes make up a character, if you are using UTF-8 encoded strings. Therefore, string.len will give you the number of bytes, not the number of characters. And string.sub will give you a substring of bytes not a substring of characters.

Some UTF-8 basics:

If you need some refreshing on the conceptual basics of Unicode, you should check out this article.

UTF-8 is one possible (and very important) implementation of Unicode - and probably the one you are dealing with. As opposed to UTF-32 and UTF-16 it uses a variable number of bytes (from 1 to 4) to encode each character. In particular, the ASCII characters 0 to 127 are represented with a single byte, so that ASCII strings can be correctly interpreted using UTF-8 (and vice versa, if you only use those 128 characters). All other characters start with a byte in the range from 194 to 244 (which signals that more bytes follow to encode a full character). This range is further subdivided, so that you can tell from this byte, whether 1, 2 or 3 more bytes follow. Those additional bytes are called continuation bytes and are guaranteed to be only taken from the range from 128 to 191. Therefore, by looking at a single byte we know where it stands in a character:

  • If it's in [0,127], it's a single-byte (ASCII) character
  • If it's in [128,191], it's part of a longer character and meaningless on its own
  • If it's in [191,244], it marks the beginning of a longer character (and tells us how long that character is)

This information is enough to count characters, split a UTF-8 string into characters and do all sorts of other UTF-8-sensitive manipulations.

Some pattern matching basics:

For the task at hand we need a few of Lua's pattern matching constructs:

[...] is a character class, that matches a single character (or rather byte) of those inside the class. E.g. [abc] matches either a, or b or c. You can define ranges using a hyphen. Therefore [\33-\127] for example, matches any single one of the bytes from 33 to 127. Note that \127 is an escape sequence you can use in any Lua string (not just patterns) to specify a byte by its numerical value instead of the corresponding ASCII character. For instance, "a" is the same as "\97".

You can negate a character class, by starting it with ^ (so that it matches any single byte that is not part of the class.

* repeats the previous token 0 or more times (arbitrarily many times - as often as possible).

$ is an anchor. If it's the last character of the pattern, the pattern will only match at the end of the string.

Combining all of that...

...your problem reduces to a one-liner:

local function lastChar(s)
    return string.match(s, "[^\128-\191][\128-\191]*$")
end

This will match a character that is not a UTF-8 continuation character (i.e., that is either single-byte character, or a byte that marks the beginning of a longer character). Then it matches an arbitrary number of continuation characters (this cannot go past the current character, due to the range chosen), followed by the end of the string ($). Therefore, this will give you all the bytes that make up the last character in the string. It produces the desired output for all 4 of your examples.

Equivalently, you can use gsub to remove that last character from your string:

function deleteLastCharacter(s)
    return string.gsub(s, "[^\128-\191][\128-\191]*$", "")
end

The match is the same, but instead of returning the matched substring, we replace it with "" (i.e. remove it) and return the modified string.

Martin Ender
  • 43,427
  • 11
  • 90
  • 130
  • Thanks a lot. I wish I understood what you did but will keep reading until I get it. The solution works for all cases except for Hebrew where the characters go in other direction. – fun_programming Apr 12 '13 at 21:09
  • 1
    @learningphp my solution is not so different from your pattern. if there is a specific thing you don't get, feel free to ask... maybe I can elaborate a bit more on that then. – Martin Ender Apr 12 '13 at 21:52
  • There's too much to ask :). I don't what [^\128-\191][\128-\191]*$","") means at all. I don't know what continuation characters are. I don't understand the basics of UTF, etc. reading this http://stackoverflow.com/questions/9356169/utf-8-continuation-bytes but it doesn't make sense to me since it's all new to me. Haven't learned pattern matching either. – fun_programming Apr 12 '13 at 22:17
  • 1
    @learningphp I tried to expand the answer to include the bare minimum of knowledge you need about UTF-8 and pattern matching to understand the solution. I hope it helps. – Martin Ender Apr 12 '13 at 23:42
  • Wow.. this is REALLY helpful. thanks for describing it this way. I can't upvote you enough! – fun_programming Apr 13 '13 at 01:00
4

Here's another way to do it; it shows how to iterate through a string of characters in utf8:

function butlast (str)
    local i,j,k = 1,0,-1
    while true do
        s,e = string.find(str,".[\128-\191]*",i)
        if s then
            k = j
            j = e
            i = e + 1
        else break end
    end
    return string.sub(str,1,k)
end

Sample use:

> return butlast"כראע"
כרא
> return butlast"ㅎㄹㅇㅇㅅ"
ㅎㄹㅇㅇ
> return butlast"anñ"
an
> return butlast"hello there"
hello ther
> 
Doug Currie
  • 40,708
  • 1
  • 95
  • 119
3

Going by prapin's solution here:

function lastCharacter(str)
  return str:match("[%z\1-\127\194-\244][\128-\191]*$")
end

You can then get the length of returned value to see if it's multi-byte or not; you can also remove it from the string using gsub function:

function deleteLastCharacter(str)
  -- make sure to add "()" around gsub to force it to return only one value
  return(str:gsub("[%z\1-\127\194-\244][\128-\191]*$", ""))
end

for _, str in pairs{"hello there", "anñ", "כראע"} do
  print(str, " -->-- ", deleteLastCharacter(str))
end

Note that these patterns only work with valid UTF-8 strings. If you have a possibly invalid one, you may need to apply a more complex logic.

Community
  • 1
  • 1
Paul Kulchenko
  • 25,884
  • 3
  • 38
  • 56