4

I have a file containing many lines of text, and I want to match only those lines that contain a number of words. All words must be present in the line, but they can come in any order.

So if we want to match one, two, three, the first 2 lines below would be matched:

three one four two <-- match
four two one three <-- match
one two four five
three three three

Can this be done using QRegExp (without splitting the text and testing each line separately for each word)?

sashoalm
  • 75,001
  • 122
  • 434
  • 781

2 Answers2

2

Yes it is possible. Use a lookahead. That will check the following parts of the subject string, without actually consuming them. That means after the lookahead is finished the regex engine will jump back to where it started and you can run another lookahead (of course in this case, you use it from the beginning of the string). Try this:

^(?=[^\r\n]*one)(?=[^\r\n]*two)(?=[^\r\n]*three)[^\r\n]*$

The negated character classes [^\r\n] make sure that we can never look past the end of the line. Because the lookaheads don't actually consume anything for the match, we add the [^\r\n]* at the end (after the lookaheads) and $ for the end of the line. In fact, you could leave out the $, due to greediness of *, but I think it makes the meaning of the expression a bit more apparent.

Make sure to use this regex with multi-line mode (so that ^ and $ match the beginning of a line).

EDIT:

Sorry, QRegExp apparently does not support multi-line mode m:

QRegExp does not have an equivalent to Perl's /m option, but this can be emulated in various ways for example by splitting the input into lines or by looping with a regexp that searches for newlines.

It even recommends splitting the string into lines, which is what you want to avoid.

Since QRegExp also does not support lookbehinds (which would help emulating m), other solutions are a bit more tricky. You could go with

(?:^|\r|\n)(?=[^\r\n]*one)(?=[^\r\n]*two)(?=[^\r\n]*three)([^\r\n]*)

Then the line you want should be in capturing group 1. But I think splitting the string into lines might make for more readable code than this.

Martin Ender
  • 43,427
  • 11
  • 90
  • 130
  • Thanks, now that I think about it, it probably wouldn't matter that much for the performance, even now most of the time seems to be spent reading the text from the hard drive - the file is rather big, and I was trying to think of ways to improve performance. – sashoalm Dec 03 '12 at 20:42
  • @satuon before trying to improve performance, you should **always** profile to find the bottleneck. Worst that could happen is that you a) spend unnecessary time and b) impair the readability of your code just to optimize a portion of your program that doesn't even make a difference. But at least now you know, how to assert multiple patterns regardless of order with regular expressions ;) – Martin Ender Dec 03 '12 at 20:43
  • Yes, now that I think about it, may be I can just spawn grep, does it have multi line mode? – sashoalm Dec 03 '12 at 20:45
  • @satuon not sure, but I believe grep might work in multiline mode by default (it would make sense) – Martin Ender Dec 03 '12 at 20:46
1

You can use the MultilineOption PatternOption from the new Qt5 QRegularExpression like:

QRegularExpression("\\w+", QRegularExpression::MultilineOption)
Iulian Onofrei
  • 9,188
  • 10
  • 67
  • 113