1

I have two conditions in my regex (regex used on php)

(BIOLOGIQUES\s+(\d+)\s+(\d+)\s+\/\s+(\d+))|(Dossier N.\s+:\s+(\d+)\s+(\d+)\s+\/\s+(\d+))

When I test the 1st condition with the following I get 4 match groups 1 2 3 and 4

BIOLOGIQUES                                                                                          47     131002 / 4302

Please see the 1st condition here http://www.rubular.com/r/a6zQS8Wth6

But when I test with the second condition the groups match are 5 6 7 and 8

   Dossier N°       :     47     131002 / 4302

The second condition here : http://www.rubular.com/r/eYzBJq1rIW

Is there a way to always have 1 2 3 and 4 match groups in the second condition too?

George Cummins
  • 28,485
  • 8
  • 71
  • 90
amorino
  • 375
  • 1
  • 3
  • 16
  • 1
    You have them both in the same regex. Why not separate them? –  Oct 02 '13 at 21:49
  • If this is the behavior you want, then your regex should read more like "`BIOLOGIQUES` or `Dossier N° : ` followed by the groups of digits". IOW, the "or" condition is only necessary for the first component of the regexp. – quietmint Oct 02 '13 at 21:49

2 Answers2

3

Since the parts of both regexps that match the numbers are the same, you can do the alternation just for the beginning, instead of around the entire regexp:

preg_match('/((?:BIOLOGIQUES|Dossier N.\s+:)\s+(\d+)\s+(\d+)\s+\/\s+(\d+))/u', $content, $match);

Use the u modifier to match UTF-8 characters correctly.

Barmar
  • 741,623
  • 53
  • 500
  • 612
  • Hello in the http://www.rubular.com/ it worked fine But when I put it on php it got a strange behaviour With the 1st condition I got : [0] => BIOLOGIQUES 47 131002 / 4302 [1] => BIOLOGIQUES 47 131002 / 4302 [2] => 47 [3] => 131002 [4] => 4302 the second condition nothing php: preg_match ("/((?:BIOLOGIQUES|Dossier N.\s+:)\s+(\d+)\s+(\d+)\s+\/\s+(\d+))/", $content, $codes2); print_r( $codes2); – amorino Oct 02 '13 at 22:04
  • PHP seems to have a problem with that special character after `N`, it's not treating it as a single character. If I replace it with an ordinary ASCII character, it works. – Barmar Oct 02 '13 at 22:11
  • Hello you are right /u resolved the problem But one last thing now I have on php: '[0] => Dossier N° : 47 131002 / 4302 [1] => Dossier N° : 47 131002 / 4302 [2] => 47 [3] => 131002 [4] => 4302' Could you help to not have group 1 duplicate please? Thank you – amorino Oct 02 '13 at 22:21
0

I assume your regex is compressed. If the dot is meant to abbrev. the middle initial it should be escaped. The suggestion below factors out like Barmar's does. If you don't want to capture the different names, remove the parenthesis from them.

Sorry, it looks like you intend it to be a dot metachar. Just remove the \ from it.

 # (?:(BIOLOGIQUES)|(Dossier\ N\.\s+:))\s+((\d+)\s+(\d+)\s+\/\s+(\d+))

 (?:
      ( BIOLOGIQUES )                 # (1)
   |  ( Dossier\ N \. \s+ : )         # (2)
 )
 \s+ 
 (                               # (3 start)
      ( \d+ )                         # (4)
      \s+ 
      ( \d+ )                         # (5)
      \s+ \/ \s+ 
      ( \d+ )                         # (6)
 )                               # (3 end)

Edit, the regex should be factored, but if it gets too different, a way to re-use the same capture groups is to use Branch Reset.
Here is your original code with some annotations using branch reset.

 (?|(BIOLOGIQUES\s+(\d+)\s+(\d+)\s+\/\s+(\d+))|(Dossier\ N.\s+:\s+(\d+)\s+(\d+)\s+\/\s+(\d+)))

      (?|
 br 1      (                               # (1 start)
                BIOLOGIQUES \s+ 
      2         ( \d+ )                         # (2)
                \s+ 
      3         ( \d+ )                         # (3)
                \s+ \/ \s+ 
      4         ( \d+ )                         # (4)
    1      )                               # (1 end)
        |  
 br 1      (                               # (1 start)
                Dossier\ N . \s+ : \s+ 
      2         ( \d+ )                         # (2)
                \s+ 
      3         ( \d+ )                         # (3)
                \s+ \/ \s+ 
      4         ( \d+ )                         # (4)
    1      )                               # (1 end)
      )

Or, you could factor it AND use branch reset.

 # (?|(BIOLOGIQUES\s+)|(Dossier\ N.\s+:\s+))(?:(\d+)\s+(\d+)\s+\/\s+(\d+))

      (?|
 br 1      ( BIOLOGIQUES \s+ )             # (1)
        |  
 br 1      ( Dossier\ N . \s+ : \s+ )      # (1)
      )
      (?:
 2         ( \d+ )                         # (2)
           \s+ 
 3         ( \d+ )                         # (3)
           \s+ \/ \s+ 
 4         ( \d+ )                         # (4)
      )