3

I'm sorry for the poor title, but it is a very generic question

I have to match this pattern

;AAAAAAA(BBBBBB,CCCCC,DDDDDD)
  • AAAAA = all characters starting from ";" to "(" (both ;( not included)
  • BBBBB = all characters starting from "(" to "," (both (, not included)
  • CCCCC = all characters starting from "," to "," (both ,, not included)
  • DDDDD = all characters starting from "," to ")" (both ,) not included)

The "all characters between x and y" is a problem that kills me everytime

:(

I'm using PHP and I have to match all occurrences of this pattern (preg_match_all) that also, sadly, can be on multiple lines

Thank you in advance!

DarkAjax
  • 15,955
  • 11
  • 53
  • 65
skyline26
  • 1,994
  • 4
  • 23
  • 35
  • so it will always just be 3 elements inside the parentheses? – Martin Ender Nov 22 '12 at 22:03
  • How is it related to greediness? – zerkms Nov 22 '12 at 22:05
  • What regex have you tried? Did it match too much, or too little? * See also [Open source RegexBuddy alternatives](http://stackoverflow.com/questions/89718/is-there) and [Online regex testing](http://stackoverflow.com/questions/32282/regex-testing) for some helpful tools, or [RegExp.info](http://regular-expressions.info/) for a nicer tutorial. – mario Nov 22 '12 at 22:05
  • @zerkms it is, if there are multiple occurrences and for `B` it is in any case. – Martin Ender Nov 22 '12 at 22:05
  • @m.buettner: look at the current only answer - there is nothing about greediness. No `U` modifier or `?` for making quantified ungreedy – zerkms Nov 22 '12 at 22:06
  • @zerkms yes but this is the advanced solution. if you don't think about this, greediness is what causes the problem, and the easy solution is to make things ungreedy. I just suggested an alternative. – Martin Ender Nov 22 '12 at 22:08
  • @m.buettner: these days the basic regex syntax with negated sets `[^,]` is considered to be advanced. Oh... What lookup assertions, possessive quantifiers and recursive expressions then, mega advanced? )) PS: it's funny that it is placed at "basic" section http://www.regular-expressions.info/reference.html ;-) – zerkms Nov 22 '12 at 22:10
  • 1
    @zerkms not the use negated sets. but using negated sets to avoid ungreediness. the latter seems to be the go-to recommendation, while using the negated character classes seems very unknown among newcomers to regex. – Martin Ender Nov 22 '12 at 22:12
  • @m.buettner: ok, fair enough ;-) – zerkms Nov 22 '12 at 22:13

2 Answers2

3

I would recommend you do not use an ungreedy quantifier, but instead make all repetitions mutually exclusive with their delimiters. What does this mean? It means, for instance, that A can be any character except (. Giving this regex:

;([^(]*)[(]([^,]*),([^,]*),([^)]*)[)]

Where the last [)] is not even necessary.

The PHP code would then look like this:

preg_match_all('/;([^(]*)[(]([^,]*),([^,]*),([^)]*)[)]/', $input, $matches);
$fullMatches = $matches[0];
$arrayOfAs = $matches[1];
$arrayOfBs = $matches[2];
$arrayOfCs = $matches[3];
$arrayOfDs = $matches[4];

As the comments show, my escaping technique is a matter of taste. This regex is of course equal to:

;([^(]*)\(([^,]*),([^,]*),([^)]*)\)

But I think that looks a lot more mismatched/unbalanced than the other variant. Take you pick!

Finally, for the question why this approach would be better than using ungreedy (lazy) quantifiers. Here is some good, general reading. Basically, when you use ungreedy quantifiers, the engine still has to backtrack. It tries one repetition first, then notices that ( after that doesn't match. So it has to go back into the repetition and consume another character. But then the ( still doesn't match, so back to the repetition again. With this approach however, the engine will consume as much as possible, when going into the repetition for the first time. And when all non-( characters are consumed, then the engine will be able to match the following ( right away.

Martin Ender
  • 43,427
  • 11
  • 90
  • 130
  • 1
    `[(]`, `[)]` --- why do you create sets of one character? – zerkms Nov 22 '12 at 22:05
  • 1
    `[(]` is much more confusing than `\(`. – Gumbo Nov 22 '12 at 22:06
  • 2
    @zerkms escaping. Gumbo, I guess it's a matter of taste – Martin Ender Nov 22 '12 at 22:08
  • @m.buettner So what would you use to match a literal `[]`? Simply `\[]` or rather `[[][]]`? – Gumbo Nov 22 '12 at 22:15
  • @Gumbo I usually escape these, because these **are** a lot more confusing inside short character classes. But `[(]` makes a nice, self-contained square (visually) which is easy to recognize as a single thing in the regex. Whereas `\(` still leaves [some odd tension with me](http://xkcd.com/859/). Also the character-class escaping is more highlighting than the backslash. If you have something like `[.]` or `[+]` then the relevant character is still in the centre of the space it occupies. I find `\` much more cluttering in all cases except literal square brackets, really. – Martin Ender Nov 22 '12 at 22:18
  • thank you all for the answers, i had to modify something but problem resolved! [^(;]* i needed to include ; in the first part of regex. accepting this as the most complete answer, but thanks to all! – skyline26 Nov 22 '12 at 22:35
  • @toPeerOrNotToPeer ah yeah, I guess if you have other `;` that are not right in front of a `(` then you need to do that. – Martin Ender Nov 22 '12 at 22:40
1

You could use something like this code:

preg_match_all('/;(.*?)\((.*?),(.*?),(.*?)\)/s',$text,$matches);

See it on ideone.com.

Basically, you can use .*? (question mark being ungreedy), make sure to escape the parentheses, and you may need the s modifier to have it work on multiple lines.

Variables would be in an array: $matches

bozdoz
  • 12,550
  • 7
  • 67
  • 96