3

I m trying to get a regex which will work on multi-line C comments. Managed to make it work for /* comments here */ but does not work if the comment goes to the next line. How do I make a regex which spans over multiple lines?

Using this as my input:

/* this comment
must be recognized */

The problem I get is "must, be and recognized" is matched as ID's and */ as illegal characters.

#!/usr/bin/python
import ply.lex as lex
tokens = ['ID', 'COMMENT']

t_ID   = r'[a-zA-Z_][a-zA-Z0-9_]*'

def t_COMMENT(t):
    r'(?s)/\*(.*?).?(\*/)'
    #r'(?s)/\*(.*?).?(\*/)' does not work either.
    return t

# Error handling rule
def t_error(t):
    print("Illegal character '%s'" % t.value[0])
    t.lexer.skip(1)

lex.lex()   #Build the lexer

lex.input('/* this comment\r\n must be recognised */\r\n')
while True:
    tok = lex.token()
    if not tok:break
    if tok.type == 'COMMENT':
        print tok.type

I tried quite a few: Create array of regex match(multiline) and How to handle multiple rules for one token with PLY and few other things available at http://www.dabeaz.com/ply/ply.html

Community
  • 1
  • 1
Thapelo
  • 31
  • 4
  • 1
    Oops! I just realized that you _are_ supplying the DOTALL flag via the alternate `(?s)` syntax. OTOH, your `t_COMMENT(t)` function looks odd. You aren't assigning that regex to anything, and the return statement isn't indented properly. – PM 2Ring Sep 08 '15 at 09:32
  • Hi @Thapelo. Welcome to stack overflow. To help you with this, we probably need a [minimal, complete and verifiable example](http://stackoverflow.com/help/mcve) As it is, the code you've posted won't run and it's not obvious at the moment what `t_COMMENT` is doing or how you call it. If you can edit in some more context, I'm sure someone will be along who can help. – J Richard Snape Sep 08 '15 at 09:56
  • You might find this example useful - it's a full C parser implemented in python, using PLY https://github.com/eliben/pycparser – J Richard Snape Sep 08 '15 at 09:58
  • 1
    Thanks guys for feedback, from ply [docs](http://www.dabeaz.com/ply/ply.html#ply_nn21) this regex`r'(/\*(.|\n)*?\*/)|(//.*)'`worked for me. There was something wrong with the way I was reading tokens from file. – Thapelo Sep 10 '15 at 08:40

3 Answers3

1

I use this regex when I want to find multi line comments in C:

If I want to include the '/* */' chars:

\/\*(\*(?!\/)|[^*])*\*\/

If I don't want to include it:

(?<=\*)[\n]*.*[\n]*.*[\n]*[\n]*?[\n]*(?=\*)
Shoosha
  • 21
  • 2
0

By default, in the regex used by the PLY lexer, the dot . does not math a new line \n. So if you really want to math any character, use (.|\n) instead of .

(I had the same problem, and your comment on your own question helped me so I just create an answer for the newcomers)

Q-B
  • 120
  • 1
  • 3
-1
def t_COMMENT(t):
    r'(?s)/\*.*?\*/'
    return t

As described here:

  • (?s) is a modifier that makes . also match new line feeds
  • .*? is the non-greedy version of .*. It that matches the shortest possible sequence of characters (before a \*/ that comes next)
Ali Shamakhi
  • 63
  • 1
  • 8