2

I've been working on a text editor that uses LPEG to implement syntax highlighting support. Getting things up and running was pretty simple, but I've only done the minimum required.

I've defined a bunch of patterns like this:

 -- Keywords
 local keyword = C(
    P"auto" +
    P"break" +
    P"case" +
    P"char" + 
    P"int" 
    -- more ..
  ) / function() add_syntax( RED, ... )

This correctly handles input, but unfortunately matches too much. For example int matches in the middle of printf, which is expected because I'm using "P" for a literal match.

Obviously to perform "proper" highlighting I need to match on word-boundaries, such that "int" matches "int", but not "printf", "vsprintf", etc.

I tried to use this to limit the match to only occurring after "<[{ \n", but this didn't do what I want:

  -- space, newline, comma, brackets followed by the keyword
  S(" \n(<{,")^1 * P"auto"  + 

Is there a simple, obvious, solution I'm missing here to match only keywords/tokens that are surrounded by whitespace or other characters that you'd expect in C-code? I do need the captured token so I can highlight it, but otherwise I'm not married to any particular approach.

e.g. These should match:

 int foo;
 void(int argc,std::list<int,int> ) { .. };

But this should not:

 fprintf(stderr, "blah.  patterns are hard\n");

2 Answers2

3

The LPeg construction -pattern (or more specifically -idchar in the following example) does a good job of making sure that the current match is not followed by pattern (i.e. idchar). Luckily this also works for empty strings at the end of the input, so we don't need special handling for that. For making sure that a match is not preceded by a pattern, LPeg provides lpeg.B(pattern). Unfortunately, this requires a pattern that matches a fixed length string, and so won't work at the beginning of the input. To fix that the following code separately tries to match without lpeg.B() at the beginning of the input before falling back to a pattern that checks suffixes and prefixes for the rest of the string:

local L = require( "lpeg" )

local function decorate( word )
  -- highlighting in UNIX terminals
  return "\27[32;1m"..word.."\27[0m"
end

-- matches characters that may be part of an identifier
local idchar = L.R( "az", "AZ", "09" ) + L.P"_"
-- list of keywords to be highlighted
local keywords = L.C( L.P"in" +
                      L.P"for" )

local function highlight( s )
  local p = L.P{
    (L.V"nosuffix" + "") * (L.V"exactmatch" + 1)^0,
    nosuffix = (keywords / decorate) * -idchar,
    exactmatch = L.B( 1 - idchar ) * L.V"nosuffix",
  }
  return L.match( L.Cs( p ), s )
end

-- tests:
print( highlight"" )
print( highlight"hello world" )
print( highlight"in 0in int for  xfor for_ |for| in" )
siffiejoe
  • 4,141
  • 1
  • 16
  • 16
  • I was being demanding, wanting an answer on a plate, but you definitely exceeded my expectations. Accepted. Thank you so much. –  Aug 05 '16 at 17:54
2

I think you should negate the matching pattern similar to how it's done in the example from the documentation:

If we want to look for a pattern only at word boundaries, we can use the following transformer:

local t = lpeg.locale()
function atwordboundary (p)
  return lpeg.P{
    [1] = p + t.alpha^0 * (1 - t.alpha)^1 * lpeg.V(1)
  }
end

This SO answer also discussed somewhat similar solution, so may be of interest.

There is also another editor component that uses LPeg for parsing with the purpose of syntax highlighting, so you may want to look at how they handle this (or use their lexers if it works for your design).

Community
  • 1
  • 1
Paul Kulchenko
  • 25,884
  • 3
  • 38
  • 56
  • OK looking at this more carefully this approach works for highlighting but only handles suffix-matching. i.e "ending" doesn't match "end", but "vend" does. –  Aug 01 '16 at 05:07