4

Lately I have being playing around with regex in Java, and I find myself into a problem which (theoretically) is easy to solve, but I was wandering if there is any easier way to do it (Yes, yes I am lazy), the problem is capture a group multiple times, this is:

public static void main(String[] args) {
    Pattern p = Pattern.compile("A (IvI(.*?)IvI)*? A");
    Matcher m = p.matcher("A IvI asd IvI IvI qwe IvI A"); //ANY NUMBER of IvI x IvI
    //Matcher m = p.matcher("A  A");
    int loi = 0; //last Occurrence Index
    String storage;
    while (loi >= 0 && m.find(loi)) {
        System.out.println(m.group(1));
        if ((storage = m.group(2)) != null) {
            System.out.println(storage);
        }
        //System.out.println(m.group(1));
        loi = m.end(1);
    }
    m.find();
    System.out.println("2 opt");
    Pattern p2 = Pattern.compile("IvI(.*?)IvI");
    Matcher m2 = p2.matcher(m.group(1)); //m.group(1) = "IvI asd IvI IvI qwe IvI"
    loi = 0;
    while (loi >= 0 && m2.find(loi)) {
        if ((storage = m2.group(1)) != null) {
            System.out.println(storage);
        }
        loi = m2.end(0);
    }
}

Using ONLY Pattern p is there any way to get what is inside IvI's? (in the test string would be "asd" and "qwe") considering that there could be any number of IvI's sections, something alike of what I am trying to do in the first while which is, finding the first occurrence of the group, then moving the index and search for the next group and so on and so on...

Using the code I wrote in that while it returns asd IvI IvI qwe as the group 2, not just asd and then qwe, in part I suppose it could be because of the (.*?) part, is is not supposed to be greedy but still it goes up to the qwe consuming two of the IvI's, I mention this because otherwise I may be able to use the end index of those with the matcher.find(anInt) method, but it does not work either; I don't think it is anything wrong with the regex, since the next code works without consuming the IvI.

public static void main(String[] args) {
    Pattern p = Pattern.compile("(.*?)IvI");
    Matcher m = p.matcher("bla bla blaIvI");
    m.find();
    System.out.println(m.group(1));
}

This prints: bla bla bla

THERE IS A SOLUTION I KNOW (but I am lazy remember)

(Also on the first code, bellow "2 opt" message) The solution is dividing it into sub-groups and use another regex where you process only those sub-groups one at a time...

BTW: I did my homework In this page it mentions

Since a capture group with a quantifier holds on to its number, what value does the engine return when you inspect the group? All engines return the last value captured. For instance, if you match the string A_B_C_D_ with ([A-Z])+, when you inspect the match, Group 1 will be D. With the exception of the .NET engine, all intermediate values are lost. In essence, Group 1 gets overwritten each time its pattern is matched.

But I am still hoping you to give me good news...

Ordiel
  • 2,442
  • 3
  • 36
  • 52
  • 1
    What is your expected result in this case: `A IvI a IvI IvI IvI b IvI A` and this case `A IvI a IvI IvI b IvI A IvI a IvI IvI b IvI A`? Note that 2 step matching with 2nd step regex `IvI(.*?)IvI` doesn't work for the first case. In the second case, it is one of the test cases I used when building a regex using this method: http://stackoverflow.com/questions/15268504/collapse-and-capture-a-repeating-pattern-in-a-single-regex-expression/15418942#15418942 – nhahtdh Nov 06 '14 at 09:09
  • For the first case I would be able to get a, then a " " [space] and then i does not it would not be able to find another pair of IvI, the string does not match the pattern after b, for the second I would get a, then b and since the second A match the pattern it would stop there – Ordiel Nov 06 '14 at 16:55
  • I'm not asking about your code. I'm asking about the result you want if those cases happen. – nhahtdh Nov 06 '14 at 16:59
  • I am expecting to find everything which is between 2 IvI's without sharing them for example in the first case you mention I would get a,' ' and the last b does not have an IvI predecessor therefore it does not match, in the second would be a,b, I would like to get also another a,b but that pattern stop matching with the A in the middle that would be to add to my entire regex something to repeat itself, but once again the A in the middle closes the pattern, there would be a missing A after that to start again – Ordiel Nov 06 '14 at 17:05
  • 1
    I suggest you talk code and not prose. Please update your questions by an illustrative unit test showing your expectations for all important regular and corner cases. This way somebody could provide you with code satisfying the test. – kriegaex Nov 10 '14 at 13:45

1 Answers1

6

No, unfortunately, as your citation already mentions, the java.util.regex regular expression implementation does not support retrieving any previous values of a repeated capturing group after a single match. The only way to get those, as your code illustrates, is by find()ing multiple matches of the repeated part of your regular expression.

I've also been looking at other implementations of regular expressions in Java, for example:

but I could not find any that supported it (only the Microsoft .NET engine) . If I understood correctly, implementations of regular expressions based on state machines cannot easily implement this feature. java.util.regex does not use state machines, though.

If anyone knows of a Java regular expression library that supports this behaviour, please share it, because it would be a powerful feature.

p.s. it took me quite a while to understand your question. The title is good, but the body confused me about whether I understood you correctly.

Barry NL
  • 963
  • 1
  • 9
  • 16