Non greedy condition to ignore Comments in a line using QRegExp

Question

I would like to know/have a qregexp which could extract all integers from a line but stop extracting if the digit resides in a comment section

For Example

    { 20,100,0X0},/*this line contains 2 integers*/

My code

QRegExp("(\\d+)\\}");

does the job but is not efficient since the comments can come inside the flower braces

For Example, my Expression WILL NOT WORK IF
{ 20,100/*new comment 2*/,0X0}

So how do I ignore the string inside the comment section using QRegExp and continue to search my expression

Wiktor Stribiżew · Accepted Answer · 2016-08-04T13:55:34.647

0

I suggest matching all the multiline comments as the first alternative in a regex, and match and capture the digit sequences (i.e. use the capturing group around [0-9]+ pattern):

QRegExp("/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/|\\b([0-9]+)\\b")

Now, the digits you need will be in cap(1).

See the regex demo

It also looks like you need to use word boundaries around the [0-9]+ pattern to match standalone, "whole-word" digit chunks.

Pattern details:

/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/ - an unrolled PCRE /\*.*?\*/ regex matching multiline C comments, see Mastering Regular Expressions book, Unrolling-The-Loop Components for C Comments section
| - or
\\b - a leading word boundary
([0-9]+) - Group 1 capturing one or more digits
\\b - trailing word boundary

edited Aug 04 '16 at 13:55

answered Aug 04 '16 at 13:45

Wiktor Stribiżew

607,720
39
448
563

so if I need to extract Only texts for example some macro in the code I would do QRegExp("/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/|\\b([A-Z_]+)\\b") – Sivaramakrishna Shriraam Aug 04 '16 at 13:54
The "trick" is explained here - [*The Best Regex Trick Ever*](http://www.rexegg.com/regex-best-trick.html#thetrick). Match what you need to skip and match *and capture* with `(...)` what you need. – Wiktor Stribiżew Aug 04 '16 at 13:56
Now the problem is when I have something like this /****/*2 comment*/ the digit will be extracted but it is a commented section still dont you think? – Sivaramakrishna Shriraam Aug 04 '16 at 14:28
It is not that evident since the `/****/` is a comment with 2 `*`s, but then you have `*2 comment*/` and it is not a comment already. Or does it mean there can be any `/*`s up to the first `*/`? Is this `/****/*2 comment*/` *one* comment? – Wiktor Stribiżew Aug 04 '16 at 14:33
no you were right I am just wasting your time and mine too. The answer is fine – Sivaramakrishna Shriraam Aug 04 '16 at 14:40
Could you help me solve this http://stackoverflow.com/questions/39152761/qregexp-to-extract-string-between-a-tag-in-html – Sivaramakrishna Shriraam Aug 26 '16 at 09:52
Are you just looking for `QRegExp("\\bPg_\\w+")`? – Wiktor Stribiżew Aug 26 '16 at 09:55
works in regex but fails in qregexp why would it happen§ – Sivaramakrishna Shriraam Aug 26 '16 at 10:02

Lucero · Answer 2 · 2016-08-04T13:58:59.353

0

You will need to find the comment sections separately to do this reliably, unless the regex engine supports full regex in negative lookbehind (which - according to http://www.regular-expressions.info/ - only the .NET and JGsoft engines do).

The first pass removes or skips the comment sections in your string, then you do the number matching as you like (e.g. like now).

To find comments, you can use this pattern:

/\*((?!\*/).)*\*/

If you need to deal with nested comment sections, if required, you need to do remove the comments and repeat until no more comment sections are found.

On the other hand, if nested comments are not a requirement, you can combine the comment and digit matching regexes into one and then check the matched string (or captures) to find out if it was a comment or a digit match.

edited Aug 04 '16 at 13:58

answered Aug 04 '16 at 13:45

Lucero

59,176
9
122
152

The `\/\*.*?\*\/` is very inefficient due to the lazy dot matching pattern. It was unrolled in [*Mastering Regular Expressions* book, *Unrolling-The-Loop Components for C Comments* section](http://ww2.ii.uj.edu.pl/~tabor/prII09-10/perl/master.pdf), see the pattern in my answer. Also, QRegExp does not support lazy quantifiers, and there is no need to escape the `/` slash as it is not a special regex metacharacter. – Wiktor Stribiżew Aug 04 '16 at 13:52
@WiktorStribiżew Well, how efficient it is depends on the engine implementation only. This could be implemented using a DFA which has linear runtime. That being said, I have no idea of the performance characteristics of QRegExp, but I have just found out that it does not seem to support individual lazy quantifiers at all, so I'll update my answer to reflect this. – Lucero Aug 04 '16 at 13:56
Yes, there are several ways to unroll that pattern. – Wiktor Stribiżew Aug 04 '16 at 13:57
Technically it wont, but you can make it to support, check the minimal method of QRegExp it can be set to true to help lazy quantifiers – Sivaramakrishna Shriraam Aug 04 '16 at 14:01

Non greedy condition to ignore Comments in a line using QRegExp

2 Answers2