1

I'm trying to use a regular expression to capture substrings delimited by another substring. For example, if I had the sentence

My cat is a cat.

and the delimiter I wanted to use was "cat", the output should be

My

is a

.

I've been unable to find a solution where the delimiter isn't a single character.

Edit: I'm writing this in Java, and the output represents groups returned by Java's Matcher class in a call like "myMatcher.group()". Sorry for the confusion.

Community
  • 1
  • 1
  • Where have you been searching and what were the search keywords? I mean: language? tool? What you tried? – Wiktor Stribiżew Jun 23 '16 at 07:14
  • Tushar's approach [works like this](https://regex101.com/r/lF9pV0/1). – Wiktor Stribiżew Jun 23 '16 at 07:21
  • @WiktorStribiżew I'm writing this in Java, and I was unclear with the output. I meant that each of those lines should be a group if one were using the Java Matcher class (i.e. myMatcher.group() ). – Grandfather-Paradox Jun 23 '16 at 07:55
  • @Grandfather-Paradox: Why do you want to use a Matcher if you need to *split* a string? Sorry, your question is way too unclear. – Wiktor Stribiżew Jun 23 '16 at 07:56
  • Please add your code. – Wiktor Stribiżew Jun 23 '16 at 07:58
  • @WiktorStribiżew Each substring should be returned as a match to the pattern I'm trying to write. If I'm understanding correctly, there are three cases: the substring has "cat" on both of its boundaries, it has the left end of the string (^) on one boundary and "cat" on the other, and it has "cat" on one boundary and the right side of the string on the other ($). The substring shouldn't include the word "cat" either; cat should be the delimiter. I know this can be accomplished with StringTokenizer, but I'm more interested in the regex solution. – Grandfather-Paradox Jun 23 '16 at 08:07
  • Here is the correct approach - https://ideone.com/DnegWt – Wiktor Stribiżew Jun 23 '16 at 08:08
  • @WiktorStribiżew I was unclear that I know how to use String's .split() method as well as StringTokenizer. I'm really just interested in how a regex would accomplish what I'm trying to do. – Grandfather-Paradox Jun 23 '16 at 08:09
  • @WiktorStribiżew So, somewhat more simply, it could be represented with this pseudo-regex as: (^ or cat)(substring that does not include "cat")(cat or $). – Grandfather-Paradox Jun 23 '16 at 08:11
  • You should know that what you mean is not practical, and this makes no sense since you are using Java. Here is what I think you want: http://ideone.com/748GeU. – Wiktor Stribiżew Jun 23 '16 at 08:14
  • Thank you. That is what I was looking for. – Grandfather-Paradox Jun 23 '16 at 08:19

1 Answers1

0

What you need is String#split as Tushar pointed out in the comment.

String s = "My cat is a cat.";
String[] res = s.split("cat");
System.out.println(Arrays.toString(res));

This is the only correct way to do it.

Now, you want to know how to match any text other than cat with the Matcher.

DISCLAIMER: do not use it in Java since it is highly impractical and non-performance-wise.

You may match the cat and capture it into a Group, and add another alternative to the pattern that will match any text other than cat.

String s = "My cat is a cat.";
Pattern pattern = Pattern.compile("(?i)(cat)|[^c]*(?:c(?!at)[^c]*)*");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
    if (matcher.group(1) == null) {      // Did we match "cat"?
        if (!matcher.group(0).isEmpty()) //  Is the match text NOT empty?  System.out.println(matcher.group(0));        //    Great, print it
    }
} 

See the IDEONE demo

Pattern details:

  • (?i) - case insensitive inline modifier
  • (cat) - Group 1 capturing a substring cat
  • | - or
  • [^c]*(?:c(?!at)[^c]*)* - a substring that is not a starting point for a cat substring. It is an unrolled (?s)(?:(?!cat).)* tempered greedy token.
    • [^c]* - 0+ chars other than c or C
    • (?:c(?!at)[^c]*)* - zero or more sequences of:
    • c(?!at) - c or C not followed with at, At, AT, aT
    • [^c]* - 0+ chars other than c or C
Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563