I'm interested in a pure-Lua (i.e., no external Unicode library) solution to extracting the units of a string between certain Unicode control characters and spaces. The code points I would like to use as delimiters are:
0000-0020 007f-00a0 00ad 1680 2000-200a 2028-2029 202f 205f 3000
I know how to access the code points in a string, for example:
> for i,c in utf8.codes("é$ \tπ") do print(c) end
233
36
32
9
960
128515
but I am not sure how to "skip" the spaces and tabs and reconstitute the other codepoints into strings themselves. What I would like to do in the example above, is drop the 32 and 9, then perhaps use utf8.char(233, 36) and utf8.char(960, 128515) to somehow get ["é$", "π"].
It seems that putting everything into a table of numbers and painstakingly walking through the table with for-loops and if-statements would work, but is there a better way? I looked into string:gmatch but that seems to require making utf8 sequences out of each of the ranges I want, and it's not clear what that pattern would even look like.
Is there a idiomatic way to extract the strings between the spaces? Or must I manually hack tables of code points? gmatch
does not look up to the task. Or is it?