The length of Arabic letters in Lua

Question

In Lua language when I want to get the length of a single Arabic letter (such as "ف"), the answer will be 2!

Ex.

local letter = "ف"
print( letter:len() )

Output: 2

The same problem occur when I use (string.sub(a,b)). If I want to print the first letter of an Arabic word, I can't say (string.sub(1,1).

Ex.

local word_1 = "فولت"
print( word_1:sub(1,2) )

Output: ف
as you saw I put the second argument (2) not (1) to get the correct answer.
if I put the first argument 1 the answer will be:

print( word_1:sub(1,1) )

Output: Ù

Why does Lua count the length of a single Arabic letter as a two?

And is there a way to get the right length which is 1?

It's probably Unicode representation, which means two bytes. Arabic, Hebrew, Kanji, Mandarin, etc. don't fit into the ASCII single byte per character way of thinking, because it can only have 2^8 = 256 characters. Arabic has more than that. — duffymo, Jan 15 '14 at 13:24

score 15 · Accepted Answer · edited Jan 29 '14 at 23:11

15

Lua is 8-bit clean.

In other words, a Lua string is a sequence of bytes, it doesn't support Unicode internally. The Arabic letter "ف" has 2 bytes, so Lua treats it as a string of length 2.

You need to use a special trick to manipulate Unicode, e.g, assuming UTF-8 is used, you can use this snippet to count the length of a string (Referece: Lua Unicode):

local _, count = string.gsub(unicode_string, "[^\128-\193]", "")

edited Jan 29 '14 at 23:11

StormeHawke

5,987
5
45
73

answered Jan 15 '14 at 13:27

Yu Hao

119,891
44
235
294

You should probably point out that there are links to Lua modules and libraries that do most of the string operations for you at the bottom of the linked page ([LuaUnicode](http://lua-users.org/wiki/LuaUnicode)). – dualed Jan 16 '14 at 10:15
Perhaps mention the units you count, and the assumptions: Code-points (Not graphical characters) and well-formed input. – Deduplicator Jul 28 '14 at 12:23

score 1 · Answer 2 · answered Jan 14 '15 at 11:33

1

Lua 5.3 is released now. It provides a basic UTF-8 library.

utf8.len can be used to get the length of a UTF-8 string:

print(utf8.len("ف"))
-- 1

answered Jan 14 '15 at 11:33

Yu Hao

119,891
44
235
294

score 0 · Answer 3 · answered Jan 16 '14 at 22:40

0

Lua being 8-bit clean is enough to say that Lua supports Unicode. Though without additional unicode support library, the extent of support is minimal. For any Unicode string, there are at least 4 ways to measure it: Code units, Code points, Grapheme clusters. A fourth way is bytecount, which is a constant multiple of code units, depending on which UTF is used. UTF-8: 1 UTF16: 2 UTF32: 4. So, think clearly which of those measures you need where.

answered Jan 16 '14 at 22:40

Deduplicator

44,692
7
66
118

I agree with the approach but bytecount is not a _constant_ multiple of code units. The size of a code unit is constant, given an encoding. But the number of code units, depends on the code point being encoded (except for UTF-32, which is always 1). – Tom Blodget Jan 16 '14 at 22:49
Tom, please reread your comment. Your first two sentences are in violent disagreeent with each other. And I cannot see what your last sentence should clarify or correct... – Deduplicator Mar 03 '14 at 16:37
The number of bytes in a code-unit depends on the encoding: 1 for UTF-8, 2 for UTF-16, 4 for UTF-32, for example. The number of code-units in a codepoint depends the encoding and the codepoint: U+00000 ␀ has 1 in UTF-8, 1 in UTF-16 1 in UTF-32, 2 in modified UTF-8; U+1D58B has 4 in UTF-8, 2 in UTF-16, 1 in UTF-32, and 6 in modified UTF-8. [Modified UTF-8 is a non-Unicode compliant variant of UTF-8 used by JNI.] – Tom Blodget Mar 04 '14 at 00:07
So, you concur that bytecount IS a constant multiple of codeunits. You contradicted that in your first sentence of your first comment, though the second sentence of your first comment contradicted your contradiction. That btw has nothing to do with codeunits per codepoint, which you added as a non-sequitur in your first comment, expanding upon it now. – Deduplicator Mar 04 '14 at 10:47
Going back to your answer, you might have meant "integral multiple" instead of "constant multiple". Regardless, I'm suggesting that you remove "multiple" altogether. The number of bytes in a string (which is what you were describing) cannot be calculated by multiplication, only by iterative addition (except for UTF-32). So, I think it is a misleading term in your otherwise good answer. – Tom Blodget Mar 04 '14 at 11:26
Tom, do not conflate CodeUnits and CodePoints. I clearly differentiated them for a reason. If you cannot keep them separate, of course you get confused. Naturally the constant is integral. It's even a power of 2. The important characteristic for that sentence was still the constantness. – Deduplicator Mar 04 '14 at 14:55

The length of Arabic letters in Lua

3 Answers3