Extract words in Lua split by Unicode spaces and control characters

Question

I'm interested in a pure-Lua (i.e., no external Unicode library) solution to extracting the units of a string between certain Unicode control characters and spaces. The code points I would like to use as delimiters are:

0000-0020 007f-00a0 00ad 1680 2000-200a 2028-2029 202f 205f 3000

I know how to access the code points in a string, for example:

> for i,c in utf8.codes("é$ \tπ") do print(c) end
233
36
32
9
960
128515

but I am not sure how to "skip" the spaces and tabs and reconstitute the other codepoints into strings themselves. What I would like to do in the example above, is drop the 32 and 9, then perhaps use utf8.char(233, 36) and utf8.char(960, 128515) to somehow get ["é$", "π"].

It seems that putting everything into a table of numbers and painstakingly walking through the table with for-loops and if-statements would work, but is there a better way? I looked into string:gmatch but that seems to require making utf8 sequences out of each of the ranges I want, and it's not clear what that pattern would even look like.

Is there a idiomatic way to extract the strings between the spaces? Or must I manually hack tables of code points? gmatch does not look up to the task. Or is it?

Does this answer your question? [Split string in Lua?](https://stackoverflow.com/questions/1426954/split-string-in-lua) — JosefZ, Mar 02 '21 at 22:24
Thanks, I did read that entire question (with so many great answers) and a lot of other Lua unicode questions on this site, but it seems that the pattern I need would require painstakingly generating the utf8 encodings for all code points at each end of the range. Or is there an answer in there that does not? — Ray Toal, Mar 02 '21 at 22:44

score 2 · Accepted Answer · answered Mar 02 '21 at 23:09

would require painstakingly generating the utf8 encodings for all code points at each end of the range.

Yes. But of course not manually.

local function range(from, to)
   assert(utf8.codepoint(from) // 64 == utf8.codepoint(to) // 64)
   return from:sub(1,-2).."["..from:sub(-1).."-"..to:sub(-1).."]"
end

local function split_unicode(s)
   for w in s
      :gsub("[\0-\x1F\x7F]", " ")
      :gsub("\u{00a0}", " ")
      :gsub("\u{00ad}", " ")
      :gsub("\u{1680}", " ")
      :gsub(range("\u{2000}", "\u{200a}"), " ")
      :gsub(range("\u{2028}", "\u{2029}"), " ")
      :gsub("\u{202f}", " ")
      :gsub("\u{205f}", " ")
      :gsub("\u{3000}", " ")
      :gmatch"%S+"
   do
      print(w)
   end
end

Test:

split_unicode("@\0@\t@\x1F@\x7F@\u{00a0}@\u{00ad}@\u{1680}@\u{2000}@\u{2005}@\u{200a}@\u{2028}@\u{2029}@\u{202f}@\u{205f}@\u{3000}@")

Extract words in Lua split by Unicode spaces and control characters

1 Answers1