2

This is a very contrived example, but I've searched for things like "regex capture repetition match" and so forth with no luck.

How to get all captures of subgroup matches with preg_match_all()? is the nearest I got.

Rather than an example, here's (sort of) my problem.

I have a tag in the form:

 name>>thing1(d1),thing2(d2),thing3(d3)::otherName

I want to extract the name, the things with their data (one argument at most) and the bit at the end, the otherName

A rule to do this might look something like:

^([a-z]+)>>(([a-z]+\([a-z]+\)(,[a-z]+\([a-z]+\))*)?::([a-zA-Z]]+)$

(This rule wont actually work, I'm missing the numbers, but you should get a feel for the form)

As you can see I'm actually matching my pattern here, I want to pull out the chunks matched by the repetition with the *

Incase it isn't clear since the edit

I am not having trouble matching my tags. I want to extract all parts of the tag in one step. So I want an array like:

 Array(`name`,Array(`thing1`,`d1`),Array(Array(`thing2`,`d2`),
 Array(`thing3`,`d3`)),`otherName`)

I do have a fallback

I want to do this in one expression as I see no technical reason to not be able to do this. However as a "plan B" I can just extract the chunk between >> and the :: and use preg_match_all - I pose this question because performance is at the back of my mind and my rule already looks at the information, I just have to capture it. So I wouldn't say it's a premature optimisation.

Community
  • 1
  • 1
Alec Teal
  • 5,770
  • 3
  • 23
  • 50
  • Thinking about it if you have repetitions within repetitions, knowing what came from where in the resulting array would require more information. I still wonder if there's a way though – Alec Teal Oct 15 '15 at 13:29
  • 1
    A repeated capturing group will only capture the last iteration. You will need to put a capturing group around the repeated group to capture all iterations. –  Oct 15 '15 at 13:36
  • @WashingtonGuedes if I changed the example to `(([a-z])+)` then I'd get `bc` as a match and `c`. Not an array with `b` and `c`, which is what I'm aiming for. As I mentioned my fallback is to capture the lot, then `preg_match_all` it. – Alec Teal Oct 15 '15 at 13:38
  • Why not `preg_match_all`? –  Oct 15 '15 at 13:39
  • @WashingtonGuedes Because there's only one thing to match here. If I preg-match-all it I'll get an array with 1 element of my output above. – Alec Teal Oct 15 '15 at 13:40
  • although i agree it's a dumb suggestion, there's some truth to what @stribizhev said. As far as I know, only perl and .net allow for capturing the repetition matches. – d0nut Oct 15 '15 at 13:40
  • @WashingtonGuedes to explain, the work around is to capture the bit with repetitions THEN apply `preg_match_all` to that with the patterns I want to capture. So here I'd take the `bc` bit and `preg_match_all("([a-z])","bc")` to get an array with `b` and `c`. But there's no reason why one DFA shouldn't be able to do this and it'd be simpler if I could not do this (as I'd have to make sure the `preg_match_all` matched the lot, an extra step of validation - as well as parsing and running another regex)) – Alec Teal Oct 15 '15 at 13:42
  • @iismathwizard I'm pretty sure that PHP and Perl use the same library for regex, they both have "Perl-only" things like recursion and the PCRE library actually stands for "perl compatible" - that's why I didn't think having a solution would be far fetched. – Alec Teal Oct 15 '15 at 13:43
  • @iismathwizard to add to that http://www.pcre.org/ this one. – Alec Teal Oct 15 '15 at 13:44
  • @WashingtonGuedes that would return `[0] => abc, [1] => bc` – d0nut Oct 15 '15 at 13:47
  • @WashingtonGuedes that's not actually what I'm trying to do. It's an example. What I actually have is something of the form `name>>things(d),separated(d1),by(d2),commas(d3)::somethingAtEnd` - I can match this with one epxressions. I want the comma separated things with the thing in the brackets after them. It isn't difficult to write a rule that does this (something like `[a-z]+\([a-z]+\)(,[a-z]+\([a-z]+\))*` will work and as you can see I'm capturing the chunks on the way. What I want is to get those chunks from one expression – Alec Teal Oct 15 '15 at 13:48
  • 1
    @iismathwizard see comment above, should I add that to question? Does it make more sense now? – Alec Teal Oct 15 '15 at 13:51
  • Alec Teal, just update your post with your specific problem that you posted in the comment. I think you would see answers – james jelo4kul Oct 15 '15 at 13:51
  • @AlecTeal I think i was wrong in suggesting perl could do this. I think .NET is the only one that actually has a way to do it as it preserves the backtracking information for capture groups in this way. – d0nut Oct 15 '15 at 13:51
  • I got your point, but I'd use `preg_match_all` with `((?<!^)\w)` –  Oct 15 '15 at 13:52
  • @WashingtonGuedes this isn't helpful either. – d0nut Oct 15 '15 at 13:53
  • @iismathwizard done, is it clear? – Alec Teal Oct 15 '15 at 13:57
  • @AlecTeal yes, it's clear. It's been clear this whole time. I'm not sure why you felt you had to explain it further but alright. In the end, I feel like your fallback is the only way to do this, unfortunately. – d0nut Oct 15 '15 at 13:58
  • Just to defend perl (a language I don't know but respect for the witchcraft that it does) http://sunsite.ualberta.ca/Documentation/Misc/perl-5.6.1/pod/perlretut.html#matching%20repetitions @iismathwizard - PCRE does seem to support this (you have to scroll down quite far, to almost where it says "search and replace") – Alec Teal Oct 15 '15 at 14:03
  • Well I've given it 40 minutes tanks @iismathwizard and to that other guy for the Black-Adder like solutions (when Edmund doesn't want people to know he's in love with Bob his manservant who is actually a girl the wise lady has 3 suggestions. "Kill Bob", "Kill yourself" and 3 "Kill everybody in the whole world" - That dot net thing .... ) – Alec Teal Oct 15 '15 at 14:11
  • @AlecTeal ahh nice find. I'm not a perl developer so it's definitely not within my realm of knowledge ^^ – d0nut Oct 15 '15 at 14:51
  • @iismathwizard you know in mathematics where you have a nice set of operations, like a group, or a great set of base classes for a stunning object orientated system. Regex is kind of like that. But then Perl got to it and they added stuff, magical stuff, that ruined the DFA model, sort of and ruined the purity of regex, but in so doing created something pure in a different way, a dirty way. They did terrible things, terrible, but great things. If something ever seems sort of dirty but pure, Perl can do it. – Alec Teal Oct 15 '15 at 14:54
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/92402/discussion-between-alec-teal-and-iismathwizard). – Alec Teal Oct 15 '15 at 15:03

2 Answers2

1

So as discussed in the comments (and to stop people posting rules that match the text (SERIOUSLY, read the Q)) I shall post the "solution" here.

I use this rule:

^([a-z]+)>>(.*)::([a-z]+)$

(Or something to that effect)

Then I can use preg_match_all on the middle capture and extract the data that way. Annoyingly this doesn't check for commas. But I can scrap that requirement.

So something like:

 preg_match_all("([a-z]+)\(([a-z]+)\)",...

On that.

Alec Teal
  • 5,770
  • 3
  • 23
  • 50
  • Someone downvoted this! I'm going to upvote it back to 0 at least. – Chris Lear Oct 15 '15 at 14:28
  • @anubhava incase it isn't clear enough __I have no problems writing rules that match the tag__ the problem is capturing (as in getting the results in an array) a capture group that is repeated. – Alec Teal Oct 15 '15 at 14:37
  • @anubhava if you do `/^a(.)+$` on the string `abc` the first capture is reported as `c` (The last thing it matched) - I want an array with 2 elements, `b` and `c` for this. Which cannot be done. – Alec Teal Oct 15 '15 at 14:44
  • `preg_match_all('/(^a|\G)(.)/', 'abc', $matches); print_r($matches);` gives `b` and `c` as 2 separate matches. – anubhava Oct 15 '15 at 14:47
0

Maybe I'm missing something... can't you use something like this:

/(?:(.*)>>)|(?:(thing.*?\)),?)|(?:::(.*))/g

Chris Lear
  • 6,592
  • 1
  • 18
  • 26
  • 1
    Yes you are. How does this get an array of the things with their data? This just matches the entire thing. – Alec Teal Oct 15 '15 at 14:16
  • https://regex101.com/r/qQ2dE4/75 seems to think it works. Well, it does what I thought you were asking for, which is to yield these matches: `name thing1(d1) thing2(d2) thing3(d3) otherName` – Chris Lear Oct 15 '15 at 14:17
  • Because you missed the comments, I've added to the question, please ignore the other guy who answered. – Alec Teal Oct 15 '15 at 14:20
  • I see what I actually missed, which is that you want thing1 and d1 to match separately – Chris Lear Oct 15 '15 at 14:21
  • Once again I can match the tag, that isn't the problem here. I want to get an array of matches that's populated with the data matched from a repetition. – Alec Teal Oct 15 '15 at 14:21
  • Yes, I see. The best I can do is `/(?:(.*)>>)|(?:(thing.*?)\((.*?)\),?)|(?:::(.*))/g`, which returns the right matches, but not in the right sort of array. – Chris Lear Oct 15 '15 at 14:23
  • That is why the question was asked, because it isn't the right sort of array. – Alec Teal Oct 15 '15 at 14:23
  • 3
    I'm out of ideas. Maybe switch to .NET :) – Chris Lear Oct 15 '15 at 14:25