10

Is there any way to extract the first letter of a UTF-8 encoded string with Lua?

Lua does not properly support Unicode, so string.sub("ÆØÅ", 2, 2) will return "?" rather than "Ø".

Is there a relatively simple UTF-8 parsing algorithm I could use on the string byte per byte, for the sole purpose of getting the first letter of the string, be it a Chinese character or an A?

Or is this way too complex, requiring a huge library, etc.?

Yu Hao
  • 119,891
  • 44
  • 235
  • 294
forthrin
  • 2,709
  • 3
  • 28
  • 50

2 Answers2

18

You can easily extract the first letter from a UTF-8 encoded string with the following code:

function firstLetter(str)
  return str:match("[%z\1-\127\194-\244][\128-\191]*")
end

Because a UTF-8 code point either begins with a byte from 0 to 127, or with a byte from 194 to 244 followed by one or several bytes from 128 to 191.

You can even iterate over UTF-8 code points in a similar manner:

for code in str:gmatch("[%z\1-\127\194-\244][\128-\191]*") do
  print(code)
end

Note that both examples return a string value for each letter, and not the Unicode code point numerical value.

prapin
  • 6,395
  • 5
  • 26
  • 44
  • Brilliant! This was exactly the answer I was looking for. Short and precise. – forthrin Nov 05 '12 at 19:17
  • This is reasonable for data that's already been validated but you might want to be careful with data which hasn't been. – bames53 Nov 05 '12 at 19:20
6

Lua 5.3 provide a UTF-8 library.

You can use utf8.codes to get each code point, and then use utf8.char to get the character:

local str = "ÆØÅ"
for _, c in utf8.codes(str) do
  print(utf8.char(c))
end

This also works:

local str = "ÆØÅ"
for w in str:gmatch(utf8.charpattern ) do
  print(w)
end

where utf8.charpattern is just the string "[\0-\x7F\xC2-\xF4][\x80-\xBF]*" for the pattern to match one UTF-8 byte sequence.

Yu Hao
  • 119,891
  • 44
  • 235
  • 294