39

This question has been bugging me for a long time now but essentially I'm looking for the most efficient way to grab all Strings between two Strings.

The way I have been doing it for many months now is through using a bunch of temporary indices, strings, substrings, and it's really messy. (Why does Java not have a native method such as String substring(String start, String end)?

Say I have a String:

abcabc [pattern1]foo[pattern2] abcdefg [pattern1]bar[pattern2] morestuff

The end goal would be to output foo and bar. (And later to be added into a JList)

I've been trying to incorporate regex in .split() but haven't been successful. I've tried syntax using *'s and .'s but I don't think it's quite what my intention is especially since .split() only takes one argument to split against.

Otherwise I think another way is to use the Pattern and Matcher classes? But I'm really fuzzy on the appropriate procedure.

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
Justin
  • 557
  • 1
  • 7
  • 11
  • You definitely want to use a `Matcher` for this. – Amber Jun 29 '12 at 02:40
  • @Amber "definitely"?? That's pretty strong language considering what's possible in code. See my one-liner answer (that *doesn't* use a `matcher`!) – Bohemian Jun 29 '12 at 02:48
  • @Bohemian And see my *comment* on your answer. Just because you *can* use something doesn't mean you *should*. – Amber Jun 29 '12 at 03:00

3 Answers3

94

You can construct the regex to do this for you:

// pattern1 and pattern2 are String objects
String regexString = Pattern.quote(pattern1) + "(.*?)" + Pattern.quote(pattern2);

This will treat the pattern1 and pattern2 as literal text, and the text in between the patterns is captured in the first capturing group. You can remove Pattern.quote() if you want to use regex, but I don't guarantee anything if you do that.

You can add some customization of how the match should occurs by adding flags to the regexString.

  • If you want Unicode-aware case-insensitive matching, then add (?iu) at the beginning of regexString, or supply Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE flag to Pattern.compile method.
  • If you want to capture the content even if the two delimiting strings appear across lines, then add (?s) before (.*?), i.e. "(?s)(.*?)", or supply Pattern.DOTALL flag to Pattern.compile method.

Then compile the regex, obtain a Matcher object, iterate through the matches and save them into a List (or any Collection, it's up to you).

Pattern pattern = Pattern.compile(regexString);
// text contains the full text that you want to extract data
Matcher matcher = pattern.matcher(text);

while (matcher.find()) {
  String textInBetween = matcher.group(1); // Since (.*?) is capturing group 1
  // You can insert match into a List/Collection here
}

Testing code:

String pattern1 = "hgb";
String pattern2 = "|";
String text = "sdfjsdkhfkjsdf hgb sdjfkhsdkfsdf |sdfjksdhfjksd sdf sdkjfhsdkf | sdkjfh hgb sdkjfdshfks|";

Pattern p = Pattern.compile(Pattern.quote(pattern1) + "(.*?)" + Pattern.quote(pattern2));
Matcher m = p.matcher(text);
while (m.find()) {
  System.out.println(m.group(1));
}

Do note that if you search for the text between foo and bar in this input foo text foo text bar text bar with the method above, you will get one match, which is  text foo text .

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
  • Thank you! :) This works great! Just one thing, the part `String textInBetween = m.group(1); // Since (.*?) is capturing group 1 ` should probably be `matcher.group(1)` but that's a minor typo and the testing code works brilliantly! – Justin Jun 29 '12 at 03:32
  • @Justin: Thanks for spotting the typo. I copy and paste, but failed to edit everything. – nhahtdh Jun 29 '12 at 03:43
  • This doesn't work when new line character is between our starting and ending words. – Michał Tajchert Oct 12 '15 at 08:29
  • 2
    @Tajchert: Just change the part `(.*?)` to `(?s)(.*?)`, or add `Pattern.DOTALL` flag into `Pattern.compile`. – nhahtdh Oct 13 '15 at 02:16
  • 1
    Beautiful clean solution - thank you all for solving this problem for us all. To all, please make sure you add Pattern.DOITALL so you can capture multiline text between your patterns. – Prashanth Jun 28 '19 at 15:04
  • @Prashanth Thanks alot! I was trying out the code but it wasn't working for multi-line. Your flag helped me! – Rohit Kumar Aug 11 '19 at 14:31
13

Here's a one-liner that does it all:

List<String> strings = Arrays.asList( input.replaceAll("^.*?pattern1", "")
    .split("pattern2.*?(pattern1|$)"));

The breakdown is:

  1. Remove everything up to pattern1 (required to not end up with an empty string as the first term)
  2. Split on input (non-greedy .*?) between pattern2 and pattern1 (or end of input)
  3. Use the utility method Arrays.asList() to generate a List<String>

Here's some test code:

public static void main( String[] args ) {
    String input = "abcabc pattern1foopattern2 abcdefg pattern1barpattern2 morestuff";
    List<String> strings = Arrays.asList( input.replaceAll("^.*?pattern1", "").split("pattern2.*?(pattern1|$)"));
    System.out.println( strings);
}

Output:

[foo, bar]
Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • 1
    And is also extremely difficult to follow. I would not want to see this in code I had to maintain. – Amber Jun 29 '12 at 02:42
  • Really? I coded this in a jiffy. Anyway, just add my explanation as in-code comments and everybody would be happy – Bohemian Jun 29 '12 at 03:03
  • Or you could do it using a Matcher, not have to use comments to explain what's going on, and better support potential future changes to the requirements - for instance, your solution breaks down if it becomes desirable to match, said, multiple different pairs of start/end markers. Using a matcher also doesn't require constructing an intermediate string, which could have a significant performance aspect if the strings being operated on are large. – Amber Jun 29 '12 at 03:05
  • Spiffy one-liner! I'm going to try to get more rep and come back to mark this as useful! – Justin Jun 29 '12 at 03:33
  • 1
    @Justin Thanks. I live my the mantra "less code is good" (with the caveat that it remains readable) - it keeps the "signal to noise ratio" of your code high. I would happily use this code in production. It's easy to understand if you know regex well, and it uses the API to do all the heavy lifting for you. I can't understand why people voted the other answer over this one - it has heaps of code and does nothing more than this one line, and having heaps of code makes it ***less*** readable! – Bohemian Jun 29 '12 at 03:58
  • @Bohemian, do you have some solution when pattern1 and pattern2 matching should be done with ignore case? – ddmytrenko Jan 14 '14 at 12:55
  • 1
    @ddmytrenko sure, just add the "ignore case" flag `(?i)` to the regexes: `List strings = Arrays.asList( input.replaceAll("^.*?(?i)pattern1", "") .split("(?i)pattern2.*?(pattern1|$)"));` – Bohemian Jan 14 '14 at 13:50
13

Try this:

String str = "its a string with pattern1 aleatory pattern2 things between pattern1 and pattern2 and sometimes pattern1 pattern2 nothing";
Matcher m = Pattern.compile(
                            Pattern.quote("pattern1")
                            + "(.*?)"
                            + Pattern.quote("pattern2")
                   ).matcher(str);
while(m.find()){
    String match = m.group(1);
    System.out.println(">"+match+"<");
    //here you insert 'match' into the list
}

It prints:

> aleatory <
> and <
> <
elias
  • 15,010
  • 4
  • 40
  • 65
  • What if I want pattern 1 and pattern 2 to be included in the output? – R11G Nov 11 '16 at 02:29
  • 2
    @R11G you can simply concat the pattern variables in the output, or move the parenthesis to include the patterns: `"("+ Pattern.quote(pat1) + ".*?" + Pattern.quote(pat2) + ")"` and grab it by `m.group(1)`. – elias Nov 11 '16 at 12:31