how to write a lua pattern for words with umlauts

Question

Words like "Annähren", "Überbringen", "Malmö" are not catched by

for w in string.gmatch(str, "%w+") do
     print(w) 
end

Any solution? thanks!

Can you try "%S+". I remember reading somewhere that %S represents every char that is NOT space. So : `for w in string.gmatch(str, "%S+")` — Arun R, Sep 11 '13 at 21:39
That's close to my (hopefully) final solution: `for w in string.gmatch(myStr, "[^,;]+") do print(w) end` That works for my needs. — sunmils, Sep 14 '13 at 16:53

score 3 · Answer 1 · answered Sep 11 '13 at 14:36

The Lua string library does not intrinsically support any character encoding other than ASCII, and assumes all characters are 1 byte. While lua strings are 8-bit clean, this means that functions like string.sub expect offsets in bytes even in multi-byte character encodings, and functions like string.match will not behave as expected with non-ASCII encodings. It is worth reading the wiki page on Unicode in Lua, much of which also applies to other non-ASCII character encodings.

For your issue in particular, 'ö' is (in, for example, UTF-8) encoded as the two bytes C3 B6, which means that it will not be recognized by '%w' (which looks for characters in the a-z range, and has no concept of characters spanning multiple bytes). '[\xc3\xb6]+' will match it, but will also match a lot of other things, not all of which are even valid UTF-8 - and using '[ö]' has the same issue, as lua will interpret it as the same thing (a sequence of two bytes rather than a single character). If you are not using UTF-8, the specifics are different, but the basic problem remains the same.

The wiki page links a number of UTF-8 aware string library implementations for lua, such as slnunicode. Other encodings do not appear to be widely used by the community, so if you are using an encoding other than UTF-8, your best bet may to be convert to UTF-8 and then use that library or another like it.

In practice, you are basically correct. The Lua spec (and source code) don't require any particular character set and encoding. The behavior of a few functions in Lua's `string` library depends on the the C runtime library (or equiv) that it is built for. Lua builders should provide their users with technical data on character sets, number characteristics, etc. **The `string` data type is a counted sequence of bytes, not a sequence of characters.** — Tom Blodget, Sep 14 '13 at 03:21

score 1 · Answer 2 · edited May 23 '17 at 12:25

1

You may try the following:

local str = "Annähren, Überbringen, Malmö"
for w in string.gmatch(str, "[%w\128-\244]+") do
  print(w) 
end

It's not strictly correct as it ignores some UTF-8 combinations, but it may work for you. This SO answer and this post on validating UTF-8 may be useful.

edited May 23 '17 at 12:25

Community

1
1

answered Sep 11 '13 at 01:04

Paul Kulchenko

25,884
3
38
56

1

I'm not sure the OP is working with UTF-8. It might be some extended ASCII encoding, thus extending the range up to `\255` could be needed (or adding the specific char codes, if the OP can find out which they are). – LorenzoDonati4Ukraine-OnStrike Sep 11 '13 at 05:17
Could be; I'm not sure either. That's why I said "may try" ;) – Paul Kulchenko Sep 11 '13 at 05:43
Yep! That's why I felt like adding that hint. :-) – LorenzoDonati4Ukraine-OnStrike Sep 11 '13 at 05:47
If one has a text file or string (including Lua source) and doesn't know the character set and encoding, then one has experienced *data loss.* If one knows the character set and encoding of the data but doesn't know if the functions to be used are compatible with it then one is [programming by coincidence](http://pragprog.com/the-pragmatic-programmer/extracts/coincidence). – Tom Blodget Sep 14 '13 at 03:28

how to write a lua pattern for words with umlauts

2 Answers2