Matching Unicode punctuation using LPeg

Question

I am trying to create an LPeg pattern that would match any Unicode punctuation inside UTF-8 encoded input. I came up with the following marriage of Selene Unicode and LPeg:

local unicode     = require("unicode")
local lpeg        = require("lpeg")
local punctuation = lpeg.Cmt(lpeg.Cs(any * any^-3), function(s,i,a)
  local match = unicode.utf8.match(a, "^%p")
  if match == nil
    return false
  else
    return i+#match
  end
end)

This appears to work, but it will miss punctuation characters that are a combination of several Unicode codepoints (if such characters exist), as I am reading only 4 bytes ahead, it probably kills the performance of the parser, and it is undefined what the library match function will do, when I feed it a string that contains a runt UTF-8 character (although it appears to work now).

I would like to know whether this is a correct approach or if there is a better way to achieve what I am trying to achieve.

It would help to have a concrete example of what it fails to match along with the expected result. — Paul Kulchenko, Aug 17 '16 at 22:23
It does not fail to match any input I am throwing it, I am just not confident this is the correct approach and I feel like I am introducing subtle bugs that will bite me later. — Witiko, Aug 17 '16 at 22:27

score 3 · Accepted Answer · answered Aug 18 '16 at 07:18

The correct way to match UTF-8 characters is shown in an example in the LPeg homepage. The first byte of a UTF-8 character determines how many more bytes are a part of it:

local cont = lpeg.R("\128\191") -- continuation byte

local utf8 = lpeg.R("\0\127")
           + lpeg.R("\194\223") * cont
           + lpeg.R("\224\239") * cont * cont
           + lpeg.R("\240\244") * cont * cont * cont

Building on this utf8 pattern we can use lpeg.Cmt and the Selene Unicode match function kind of like you proposed:

local punctuation = lpeg.Cmt(lpeg.C(utf8), function (s, i, c)
    if unicode.utf8.match(c, "%p") then
        return i
    end
end)

Note that we return i, this is in accordance with what Cmt expects:

The given function gets as arguments the entire subject, the current position (after the match of patt), plus any capture values produced by patt. The first value returned by function defines how the match happens. If the call returns a number, the match succeeds and the returned number becomes the new current position.

This means we should return the same number the function receives, that is the position immediately after the UTF-8 character.

This seems to be the correct approach; thank you. I feel stupid for not noticing that there is an example directly at the LPeg homepage. — Witiko, Aug 19 '16 at 10:55
As of now, it seems that `lpeg.utfR` was introduced for matching Unicode ranges in LPeg 1.1, and the Unicode decoding examples were scrapped from the docs. However, version 1.1 is not available in LuaRocks yet. WayBackMachine has the latest page with the Unicode examples [at June 4, 2023](https://web.archive.org/web/20230604202824/https://www.inf.puc-rio.br/~roberto/lpeg/). — aaa, Sep 01 '23 at 18:09

Matching Unicode punctuation using LPeg

1 Answers1