1

I'm preg_match_all looping through a string using different patterns. Sometimes these patterns look a lot like each other, but differ slightly.

Right now I'm looking for a way to stop pattern A from matching strings that only pattern B - which has a 'T' in front of the 4 digits - should match.


The problem I'm running into is that pattern A also matches pattern B:

A:

(\d{4})(A|B)?(C|D)?

... matches 1234, 1234A, 1234AD, etc.

B:

I also have another pattern:

T(\d{4})\/(\d{4})

... which matches strings like: T7878/6767

The result

When running a preg_match_all on "T7878/6767 1234AD", A will give the following matches:

7878, 6767, 1234AD

Does anyone have a suggestion how to prevent A from matching B in a string like "Some text T7878/6767 1234AD and some more text"?

Your help is greatly appreciated!

Community
  • 1
  • 1
Fluup
  • 63
  • 1
  • 8

3 Answers3

1

Scenario with boundaries

If you only want to match those specific strings within some boundaries, use those boundary patterns on each side of the pattern.

If you expect a whitespace boundary before each match, then add the (?<!\S) negative lookbehind at the start of the pattern. If you expect a whitespace boundary at the end of the match, add the (?!\S) negative lookahead. If there can be any chars (as is in your original question), then SKIP-FAIL is the only way (see below).

So, in this first case, you may use

(?<!\S)(\d{4})([AB]?)([CD]?)(?!\S)

and

(?<!\S)T(\d{4})\/(\d{4})(?!\S)

See Pattern 1 demo and Pattern 2 demo.

Scenario with no specific boundaries

You need to make sure the second pattern is skipped when you parse the string with the first one. Use SKIP-FAIL technique for this:

'~T\d{4}/\d{4}(*SKIP)(*F)|(\d{4})(A|B)?(C|D)?~'

See the regex demo.

If you do not need the capturing groups, you may simplify it to

'~T\d{4}/\d{4}(*SKIP)(*F)|\d{4}[AB]?[CD]?~'

See another demo

Details

  • T\d{4}/\d{4} - T followed with 4 digits, / and another 4 digits
  • (*SKIP)(*F) - the matched text is discarded and the next match is searched from the matched text end
  • | - or
  • \d{4}[AB]?[CD]? - 4 digits, then optionally A or B and then optionally C or D.
Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Nice solution (never heard of the backtracking control verbs). Why is this way preferable above a lookbehind on the `T and /` characters? – Niellles Nov 23 '17 at 23:28
  • @user2693053 With a lookbehind like `(?<!T)`, the first regex will stop matching the expected strings like `T1234AD`. – Wiktor Stribiżew Nov 23 '17 at 23:30
  • I guess you're correct, although we weren't really expecting that combo: "_Some text T7878/6767 1234AD and some more text_". But then again, always expect the unexpected. – Niellles Nov 23 '17 at 23:32
  • The SKIP, FAIL method seems to work very well. However, I will be running almost 50 different regex patterns, which would make this hard to maintain. I also need different capture groups for every match. I'm basically decoding a string of codes, all with their own meaning. – Fluup Nov 24 '17 at 11:29
  • @Fluup I doubt you can find a better solution. At least for the problem currently explained in the question. – Wiktor Stribiżew Nov 24 '17 at 11:37
  • Thanks @WiktorStribiżew Also, since all blocks that should be matched are separated by a space (e.g. T7878/6767 1234AD F1887V36/D 133RT), would a look ahead/behind solve my problem? I only just found out about this, and it seems to work, but I'm not sure if this is a proven way to go: `(?<!\S)T(\d{4})\/(\d{4})(?!\S)` and `(?<!\S)(\d{4})(A|B)?(C|D)?(?!\S)` – Fluup Nov 24 '17 at 11:40
  • @Fluup If your matches have boundaries, that should be added to the question. If you expect a whitespace boundary before each match, then the solution is easy with the `(?<!\S)` negative lookbehind. If you expect a whitespace boundary at the end of the match, add the `(?!\S)` negative lookahead. If there can be any chars (as is in your original question), then SKIP-FAIL is the only way. Also, please don't use `(A|B)?`, write it as `[AB]?` – Wiktor Stribiżew Nov 24 '17 at 11:44
  • Awesome, I learned a lot @WiktorStribiżew. I appreciate your help. This combination will probably suit my needs. Thumbs up! – Fluup Nov 24 '17 at 11:47
0

From what you're asking, your current regexes don't really work. (A|B)?(C|D)? will never match AB. So I think you meant [ABCD]

Here's your new regex:

T(\d{4})\/(\d{4}) (\d{4}[ABCD]*)

For the string input:

T7878/6767 1234AB

We get the groups:

Match 1
Full match  0-17    `T7878/6767 1234AB`
Group 1.    1-5 `7878`
Group 2.    6-10    `6767`
Group 3.    11-17   `1234AB`

Regex101

Neil
  • 14,063
  • 3
  • 30
  • 51
0

Your syntax is pretty specific, so you regex just needs to be. Get rid of all your capture groups because they are screwing things up. You only need two groups which match your string syntax exactly.

First groups looks for word bounday followed by T then 4 digits then / then 4 more digits and a word break.

Second groups matches 4 digits and then letters A-D between 0 and 2 times. It has a negative lookbehind so will only match if there is a whitespace character before the 4 digits

(\bT\d{4}\/\d{4}\b)|(?<!\S)(\d{4}[A-D]{0,2})

Preg match all output:

Array
(
[0] => Array
(
[0] => T7878/6767
[1] => 1234AB
)

[1] => Array
(
[0] => T7878/6767
[1] => 
)

[2] => Array
(
[0] => 
[1] => 1234AB
)

)
miknik
  • 5,748
  • 1
  • 10
  • 26
  • Thank you for the suggestion, @miknik. The look ahead and look behind operator in combination with SKIP-FAIL will probably suit my needs! – Fluup Nov 24 '17 at 11:50