1

Given input as:

#start
  random string 1
#end

#start
  random string 2
#end

I can write a regex as

(#start[\s\S]*?#end)

Now thing gets a bit complex with this given data:

  #start
    random string 1
    #start
      random string 2
    #end
  #end

  #start
    random string 3
  #end

and i want to get 03 matches, which are:

#start
  random string 1
#end

#start
  random string 2
#end

#start
  random string 3
#end

Will this even be possible with regex? Cause i tried most of the regex rules, but i think i missed something cause it doesn't work as I want.

Can someone show me which rules can I used to achieve this goal?

Thank you.

Xitrum
  • 7,765
  • 26
  • 90
  • 126
  • No way to do it with a single regex. `#start random string 1 #end` is missing in the string as a continuous streak of text. – Wiktor Stribiżew Jun 27 '17 at 19:54
  • Perhaps yes, perhaps no, it's depend if you give us the correct indentation. But whatever the first result will contain the second. edit your question to be more clear about that. If in real life the string isn't indented, it's not possible. – Casimir et Hippolyte Jun 27 '17 at 19:56
  • @WiktorStribiżew considering that deep level is unknown, to me this seems like a problem that can't be done with regex alone. – Xitrum Jun 27 '17 at 19:59
  • @CasimiretHippolyte the indention is not guaranteed.. – Xitrum Jun 27 '17 at 19:59
  • @Xitrum: in this case, it isn't possible. Use a more classic way with loops and stacks, flags... – Casimir et Hippolyte Jun 27 '17 at 20:02

3 Answers3

3

You cannot do it in a single regex. However you can achieve it by extracting one group at a time and remove it from the input string in the loop till no more matches could be found.

So the regex might look like the following in java

Pattern p = Pattern.compile("^.*(#start[^#]+#end).*$");

Now you can remove the portion of string from the initial line and do it in the loop.

Here is a small test program which does it:

public static void main(String args[]) {
    String re = "#start hello there #start my world #end #end #start bye dear #end ";
    Pattern p = Pattern.compile("^(.*)(#start[^#]+#end)(.*)$");
    Matcher m;
    while ( (m = p.matcher(re)).matches()) {            
        System.out.println(m.group(2));
        re = m.group(1) + m.group(3);
    }
}

and the result is:

#start bye dear #end
#start my world #end
#start hello there  #end
Serge
  • 11,616
  • 3
  • 18
  • 28
  • BTW, here is a disclaimer, In the question you have #start..#end on separate lines. in java i would avoid using regex as in the example. It has performance implication. I would rather do line by line processing and build stack of interpreted data chunkcs. – Serge Jun 28 '17 at 12:07
2

This cannot be done with regex alone. The answer to Can regular expressions be used to match nested patterns explains the detail of why this is the case. You must encode the maximum possible depth within your regex to make it work.

ngreen
  • 1,559
  • 13
  • 22
  • 3
    Note that the question you linked is very general and doesn't take account of what modern regex engines are able to do. There are many languages with a regex engine able to do that *(but not Java)*: Ruby/Perl/.net languages/PHP/R/Python with the regex module... – Casimir et Hippolyte Jun 27 '17 at 20:06
  • Well this is a Java question. – ngreen Jun 27 '17 at 20:12
  • Not the one you linked. – Casimir et Hippolyte Jun 27 '17 at 20:13
  • But I was answering *this* question, and the answer I linked to applies. – ngreen Jun 27 '17 at 20:14
  • If you know about regex, then answer this question. Don't make a blanket statement such as `This cannot be done with regex alone.` because it can, and is done every day !! Also, it's more informative to show a sample of what you mean when you say encode the max depth. –  Jun 27 '17 at 22:08
0

I got the solution from the idea of Serge's answer. The answer is good, but didn't fit my case due to the deep level is unknown. So my solution finds the deepest matched groups, remove them from the string, and then continuing on that string.

So something likes (#start((?!#start)[\s\S])*?#end)

Xitrum
  • 7,765
  • 26
  • 90
  • 126