0

I'm working on a mini project for my semester for the course Compiler Construction.

I'm designing the Scanner part as of now for Java Language in the Java Language. This scanner will produce tokens which will be later used for the parser...

Most of the work I've done is using the Java Regular Expressions. The problem i'm currently facing is that when i pre process the code to remove inline & multi line comments, it also removes the comments inside string literals if there are any. I'm using the following regex:

String regExPreProcess = "((?s)(/\\*.*?\\*/|/\\*.*))|(//.*)"

Could someone please shed some light to solve the issue. I've tried lookahead & lookbehind functionality as well, but the issue is still persisting.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
Umar Tahir
  • 333
  • 1
  • 7
  • 15
  • 1
    I'm not even sure that's something a regex _can_ do... – Louis Wasserman Nov 11 '15 at 18:18
  • @Louis is right, regexes are no use for this. You can't just pluck out the bits that don't interest you, because you can't reliably identify them without knowing the whole context. – Alan Moore Nov 11 '15 at 19:04
  • Are you sure that's what you want? What does it mean for a string literal to have a comment inside it? Why would you ever want that? – mvd Nov 11 '15 at 19:26
  • @mvd: That's the point: they're **not** comments. I believe he wants to remove all comments before he starts the "real" lexing, but he knows string literals may contain things things that *look* like comments, and he wants to know how to ignore them. (Please correct me if I'm wrong, Umar.) – Alan Moore Nov 11 '15 at 20:03
  • @Alan, yes that's what i wanna do... e.g. if there is code like "This is string //not a comment" OR "This is string /* not a comment */" Then the above regex must not remove comments inside the strings that start with comment symbols. – Umar Tahir Nov 11 '15 at 20:37
  • Maybe you should not pre-parse out comments? How about creating tokens for comments, and then just ignoring them/throwing them out when you are building the AST. – mvd Nov 11 '15 at 22:07

1 Answers1

0

You first need to make a formal definition of inline and block (multi-line) comments.

Something, like:

  • inline comment starts with an inline comment delimeter (//) placed outside string literals and block comments and ends at the end of line
  • string literal starts with a double quote (") placed outside the inline or block comments and ends with a not escaped double quote (")
  • escaped double quote is a double quote prepended with an odd number of back slashes (\)
  • block comment starts with a comment opening delimeter (/*) placed outside string literals and inline comments and ends with a comment closing delimeter (*/)

As you see, there are cyclic dependencies in these definitions. Regular expressions are not suitable for this problem. You need to process the input text sequentially: detect the start token and ignore everything till the respective end token.

user5500105
  • 297
  • 2
  • 7