2

I have a tough time figuring out a regular expression (something I have sadly almost not experience with) for the following problem:

  • text starting with a given prefix (let's say it's ab4)
  • text has a body of 4 blocks of 4 characters (that's what the 4 in ab4 stands for) each of which can be an ASCII alpha-numeric, whitespace, brackets, hyphen or a dot (basically a-zA-Z0-9 ()-.). Example: abcd, .b a, , b(a.) are all valid single blocks.
  • text body can be empty (ab4 is the only content) or contain up to the four blocks (ab4xxxx, ab4xxxxxxxx, ab4xxxxxxxxxxxx, ab4xxxxxxxxxxxxxxxx with x being a valid character)
  • text end with a CR (carriage return - \r\n). The ending is counted as a terminating character and is NOT part of the body

So far I have come up with

.*ab4([a-zA-Z0-9 ()-.]{4}){1,4}\\r\\n.*

I use regular expressions 101 to verify my regex before I add it to my C++ code. However if I input

ab4aaa bbb ccc ddd \r\n 

I get the following stats:

  • Full match:

    0-25 'ab4aaa bbb ccc ddd \r\n'

  • Group 1.:

    15-19 'ddd '

The regex verifier tells me that

A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data

but frankly I have no idea what this means. I tried (([a-zA-Z0-9 ()-.]{4}){1,4}) which didn't change much.

I'm looking for a better grouping namely one that sets the 4 blocks apart as separate groups. For the example above I'm expecting

  • Full match:

    0-25 'ab4aaa bbb ccc ddd \r\n'

    • Group 1.:

    0-3 'aaa '

    • Group 1.:

    4-7 'bbb '

    • Group 3.:

    8-11 'ccc '

    • Group 4.:

    12-15 'ddd '

rbaleksandar
  • 8,713
  • 7
  • 76
  • 161
  • What is the regex library you are using? `std::regex`? Just to clarify: in *every* regex, there are as many *groups* in the resulting match object as there are *capturing groups* inside the pattern. That number is *constant*. What you might use is the *capture* collection. However, there are only 3 regex engines supporting that feature. – Wiktor Stribiżew Nov 07 '17 at 08:36
  • I'm using the `QRegularExpression` class that is shipped with Qt. So far I know for sure it supports groups, which can be returned through the `QList QRegularExpression::capturedTexts()` function with the first capture always being the full match and the subsequent captures being the single groups. – rbaleksandar Nov 07 '17 at 08:40
  • Ok, that means you are using PCRE that does not support a capture stack for each group, so you will have to use a 2-step approach: 1) extract whole matches capturing the part you will need to process further, and 2) a smaller regex that will match multiple occurrences of the necessary pattern inside the captured data in each match. The first one will be `ab4((?:[a-zA-Z0-9 ().-]{4}){1,4})\\r\\n` (note the hyphen is at the end) and the second is `[a-zA-Z0-9 ().-]{4}` or even `.{4}` or check if there are other ways to split a string into substrings of 4-char strings in Qt. – Wiktor Stribiżew Nov 07 '17 at 08:42
  • So basically I can iterate through the whole string, apply my regex and if one is found (at the end) I chop it off and then repeat until there are no more matches left? – rbaleksandar Nov 07 '17 at 08:45
  • You iterate to find all matches, and each time the match is found, grab `captured(1)` and [split it into substrings of length 4](https://stackoverflow.com/questions/16709314/split-string-using-loop-to-specific-length-sub-units). – Wiktor Stribiżew Nov 07 '17 at 08:50

1 Answers1

2

You are using PCRE regex engine (with QRegularExpression) that does not support a capture stack for each group, so you will have to use a 2-step approach:

  • Extract whole matches capturing the part you will need to process further, and
  • Split each capture into 4-char parts.

The first extracting regex will be

ab4((?:[a-zA-Z0-9 ().-]{4}){1,4})\\r\\n
   ^                 ^          ^

Note I added capturing parentheses round the part you are interested in, and the hyphen is at the end of the character class.

Use the pattern to extract all matches from the text.

Then split the match.captured(1) into substrings of length 4. You do not really need to use a regex for this step since the string is already pre-validated during the first regex step.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • It works. The only think I feel the need to point out is that you can use `QStringRef` for getting the smaller 4-character-long chunks instead of using `std::substr`. It just creates a more readable code since you don't need to convert to `std::string` back and forth. I also added a `?` after the last parenthesis and before the `\\r` since otherwise the case `ab4\r\n` will not be covered. – rbaleksandar Nov 07 '17 at 11:25
  • Good. Also, you may just use `\\R` to match any line break sequence. – Wiktor Stribiżew Nov 07 '17 at 11:27
  • Hmmm, actually it doesn't work. -_- Try putting 8 or 16 (perhaps higher numbers are also possible) instead of the 4 inside the `{4}` and test it on the `ab4aaa bbb ccc ddd\r\n` string. It works everytime. In order to make sure it's not my extra `?` I removed it but the problem still exists. – rbaleksandar Nov 07 '17 at 11:36
  • Btw the `\\R` is not necessary since the strings ALWAYS end with a CR. – rbaleksandar Nov 07 '17 at 11:38
  • The `ab4aaa bbb ccc ddd` with a space at the end? [`{8,16}` does not match that string](https://regex101.com/r/VVIcSc/1) – Wiktor Stribiżew Nov 07 '17 at 11:48
  • There is no space at the end (unless the last block contains one). I mean if you change [`ab4((?:[a-zA-Z0-9 ().-]{4}){1,4})\\r\\n`](https://regex101.com/r/KvqW1B/4) to [`ab4((?:[a-zA-Z0-9 ().-]{8}){1,4})\\r\\n`](https://regex101.com/r/mLg5OL/1). The number of blocks isn't changed. – rbaleksandar Nov 07 '17 at 11:56
  • So, you mean the whitespace is not to be counted in as a field char? I am sorry, but that part of your question is not really covered in the post, i.e. the pattern requirements are not provided. Just now, I can suggest [`ab4([a-zA-Z0-9().-]{3}(?:\s+[a-zA-Z0-9().-]{3}){0,3})?\s*\\r\\n`](https://regex101.com/r/mLg5OL/2), but I am not really sure it will work in all scenarios (why `.` and `-` are there? Can they be used like whitespace?) You should redefine what block is in your question. – Wiktor Stribiżew Nov 07 '17 at 12:24
  • Whitespace is counted as a field char. The `.` and `-` are there since each block may contain those. I'm just wondering why I get the same result for the whole string no matter if I select the length of a single block to be 4, 8 or 16. – rbaleksandar Nov 07 '17 at 13:10
  • @rbaleksandar It is easy to explain - compare [{4}](https://regex101.com/r/9BLx9Y/1) version against the `[{8}`](https://regex101.com/r/9BLx9Y/2) version. See the red highlight? It still matches the string, just the single block length is different. – Wiktor Stribiżew Nov 07 '17 at 13:13
  • *facepalm* Indeed. Also because of the `{1,4}` condition if we have 8 characters for a block size of 4 there will be 2 blocks but for a size of 8 there will be only 1. – rbaleksandar Nov 07 '17 at 13:20