python regex to find multiline C comment spanning multiple lines

Question

I m trying to get a regex which will work on multi-line C comments. Managed to make it work for /* comments here */ but does not work if the comment goes to the next line. How do I make a regex which spans over multiple lines?

Using this as my input:

/* this comment
must be recognized */

The problem I get is "must, be and recognized" is matched as ID's and */ as illegal characters.

#!/usr/bin/python
import ply.lex as lex
tokens = ['ID', 'COMMENT']

t_ID   = r'[a-zA-Z_][a-zA-Z0-9_]*'

def t_COMMENT(t):
    r'(?s)/\*(.*?).?(\*/)'
    #r'(?s)/\*(.*?).?(\*/)' does not work either.
    return t

# Error handling rule
def t_error(t):
    print("Illegal character '%s'" % t.value[0])
    t.lexer.skip(1)

lex.lex()   #Build the lexer

lex.input('/* this comment\r\n must be recognised */\r\n')
while True:
    tok = lex.token()
    if not tok:break
    if tok.type == 'COMMENT':
        print tok.type

I tried quite a few: Create array of regex match(multiline) and How to handle multiple rules for one token with PLY and few other things available at http://www.dabeaz.com/ply/ply.html

Oops! I just realized that you _are_ supplying the DOTALL flag via the alternate `(?s)` syntax. OTOH, your `t_COMMENT(t)` function looks odd. You aren't assigning that regex to anything, and the return statement isn't indented properly. — PM 2Ring, Sep 08 '15 at 09:32
Hi @Thapelo. Welcome to stack overflow. To help you with this, we probably need a [minimal, complete and verifiable example](http://stackoverflow.com/help/mcve) As it is, the code you've posted won't run and it's not obvious at the moment what `t_COMMENT` is doing or how you call it. If you can edit in some more context, I'm sure someone will be along who can help. — J Richard Snape, Sep 08 '15 at 09:56
You might find this example useful - it's a full C parser implemented in python, using PLY https://github.com/eliben/pycparser — J Richard Snape, Sep 08 '15 at 09:58
Thanks guys for feedback, from ply [docs](http://www.dabeaz.com/ply/ply.html#ply_nn21) this regex`r'(/\*(.|\n)*?\*/)|(//.*)'`worked for me. There was something wrong with the way I was reading tokens from file. — Thapelo, Sep 10 '15 at 08:40

score 1 · Answer 1 · answered May 23 '22 at 08:03

1

I use this regex when I want to find multi line comments in C:

If I want to include the '/* */' chars:

\/\*(\*(?!\/)|[^*])*\*\/

If I don't want to include it:

(?<=\*)[\n]*.*[\n]*.*[\n]*[\n]*?[\n]*(?=\*)

answered May 23 '22 at 08:03

Shoosha

21
2

score 0 · Answer 2 · answered Feb 06 '16 at 21:35

By default, in the regex used by the PLY lexer, the dot . does not math a new line \n. So if you really want to math any character, use (.|\n) instead of .

(I had the same problem, and your comment on your own question helped me so I just create an answer for the newcomers)

Ali Shamakhi · Answer 3 · 2019-05-21T09:32:47.050

-1

def t_COMMENT(t):
    r'(?s)/\*.*?\*/'
    return t

As described here:

(?s) is a modifier that makes . also match new line feeds
.*? is the non-greedy version of .*. It that matches the shortest possible sequence of characters (before a \*/ that comes next)

edited May 21 '19 at 09:32

answered May 21 '19 at 08:24

Ali Shamakhi

63
1
8

python regex to find multiline C comment spanning multiple lines

3 Answers3