Regex - don't select comments inside double quotes

Question

I am working on Java code syntax highlighting in Android (Editext).

Using regex to highlight keywords, literals, strings, number.

Regex am using to highlight string String regex : "\"(.*?)\"|'(.*?)'"

Comment regex : "/\\*(?:.|[\\n\\r])*?\\*/|//.*+\\/\\/.*"

Sequence of regex selection is keywords regex...........string regex and last comment regex.

Above regex is doing the proper selection of normal string and comments but..

Problem is

Comments inside double quotes also get highlighted. I want to ignore comment selection inside double quotes.

Please go through the image for better understanding of problem(expected output) Anyone help or guidance will be appreciated

Giving us some examples of input and expected output/highlight would be nice :) — Nikolas Charalambidis, Aug 07 '16 at 20:40
@NikolasCharalambidis thanks for quick response. I have updated question with image. please have a look :) — Yogesh Lakhotia, Aug 07 '16 at 20:50
I don't know how Editext works, but have you tried swapping your rules? These kinds of highlighting engines surely offer ways to define a precedence order. — Lucas Trzesniewski, Aug 07 '16 at 21:07
Agreed, @LucasTrzesniewski. If it was just a normal regex, putting the string before the comment would meant that the entirety of the string is matched, meaning comments can only be matched _outside_ of strings. The comment inside a string never gets to be found because it was already consumed by the string matching regex. — Whothehellisthat, Aug 07 '16 at 21:09
Do you really think that regular expressions are the right tool to do the work of a parser? What makes you think that you can come up with regular expressions that cover all possible Java source code? — GhostCat, Aug 07 '16 at 21:11
@LucasTrzesniewski if i swap them then string in comments get highlighted. :( — Yogesh Lakhotia, Aug 07 '16 at 21:14
@Whothehellisthat on every regex check it re-check whole text and perform highlighting. — Yogesh Lakhotia, Aug 07 '16 at 21:14
This is perfectly doable in a single regex, but sounds like there are other constraints that mean it can't be done that way. Could you show us the code you're using? — Whothehellisthat, Aug 07 '16 at 21:18
@Whothehellisthat code is complex and disturbed in many files that why code adding won't be possible. Logic am using is for(iterating list of regex) { // perform highlighting highlight(regex[i],editext.getText()) } — Yogesh Lakhotia, Aug 07 '16 at 21:26
If you could rewrite so that it uses a single regex you could do different highlights based on which group was matched, etc. If not, it will be impossible. They're all separately applied, so they don't know to not step on each others' toes. — Whothehellisthat, Aug 07 '16 at 21:35
@GhostCat regex *is* the right tool for a *lexer*, which is what syntax highlighting uses. — Lucas Trzesniewski, Aug 07 '16 at 22:07
@LucasTrzesniewski ... uups, makes sense. Learned something new today ( http://stackoverflow.com/questions/2842809/lexers-vs-parsers ) — GhostCat, Aug 07 '16 at 22:11

Stefan Dollase · Accepted Answer · 2016-08-07T21:46:34.093

To me, it seems like you are simply searching for all matches of each regex. If a regex matches, you color the match. Thus, you overwrite the color of a previous match with the color of the last match.

To solve this issue, you have to use a proper lexer that is able to translate a given input text into a stream of tokens. Then, you can run over the token stream and when you encounter a token that needs to be colored, you can do this.

This prevents the current issue, that one part of the input text is matched by multiple regex and thus colored multiple times. It prevents it, because each character of the input text is associated to exactly one token in the token stream.

A lexer that uses the first longest match algorithm works like this: It searches for all regex matches that start at the beginning of the input text. It chooses the regex that has the longest match. If there are multiple regex that share the longest match, it chooses the first one. Now the lexer creates the first token of the token stream. The token consists of the token type (which is given by the regex), the start position of the match and the end position of the match. Next, the lexer searches for the next token by doing the above again. However this time, it searches for matches that start at the end position of the previous token. The lexer does this until the complete input text is transformed into a token stream, or until it encounters an invalid input.

The important part here is, that the end position of token n and the start position of token n + 1 is the same. Thus, there is no overlap and thus there is always only one color.

Regex - don't select comments inside double quotes

1 Answers1