1

I'm trying to group 2 sub-sentences of whatever reasonable length separated by a specific word (in the example "AND"), where the second can be optional. Some example:

CASE1:

foo sentence A AND foo sentence B

shall give

"foo sentence A" --> matching group 1

"AND" --> matching  group 2 (optionally)

"foo sentence B" --> matching  group 3

CASE2:

foo sentence A

shall give

"foo sentence A" --> matching  group 1
"" --> matching  group 2 (optionally)
"" --> matching  group 3

I tried the following regex

(.*) (AND (.*))?$

and it works but only if, in CASE2, i put an empty space at the final position of the string, otherwise the pattern doesn't match. If I include the space before "AND" inside round brackets group, in the case 1 the matcher includes the whole string in the first group. I wondered aroung lookahead and lookbehind assertions, but not sure they can help me. Any suggestion? Thanks

martin.p
  • 331
  • 3
  • 16

5 Answers5

2

How about just using

String split[] = sentence.split("AND");

That will split the sentence up by your word and give you a list of subparts.

greedybuddha
  • 7,488
  • 3
  • 36
  • 50
  • This is a way but the result shall not be stored in an array. Thanks anyway. – martin.p May 26 '13 at 13:21
  • Do you mean you don't want the results stored in an array? Because using split returns an array. – greedybuddha May 26 '13 at 14:54
  • Exactly. Right now I'm managing with groups stored in the matcher. – martin.p May 26 '13 at 16:21
  • Would you explain how this would handle a string like `SANDWICHES ARE TASTY AND I LIKE KITTENS` – Ro Yo Mi May 26 '13 at 16:31
  • It would split it into two subparts. split[0] == "SANDWICHES ARE TASTY " and split[1] == " I LIKE KITTENS". The .split method also takes regular expressions so you can make it include the whitespace or different cases. I just wanted to make the example very clear. – greedybuddha May 26 '13 at 16:32
2

Description

This regex will return the requested string parts into the requested groups. The and is optional, if it's not found in the string then the entire string is placed into group 1. All the \s*? forces the captured groups to have their white space trimmed automatically.

^\s*?\b(.*?)\b\s*?(?:\b(and)\b\s*?\b(.*?)\b\s*?)?$

enter image description here

Groups

0 gets the entire matching string

  1. gets the string before the seperating word and, if no and then the entire string appears here
  2. gets the separating word, in this case it's and
  3. gets the second part of the string

Java Code Example:

Case 1

import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
  String sourcestring = "foo sentence A AND foo sentence B";
  Pattern re = Pattern.compile("^\\s*?\\b(.*?)\\b\\s*?(?:\\b(and)\\b\\s*?\\b(.*?)\\b\\s*?)?$",Pattern.CASE_INSENSITIVE);
  Matcher m = re.matcher(sourcestring);
    if(m.find()){
      for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
        System.out.println( "[" + groupIdx + "] = " + m.group(groupIdx));
      }
    }
  }
}

$matches Array:
(
    [0] => foo sentence A AND foo sentence B
    [1] => foo sentence A
    [2] => AND
    [3] =>  foo sentence B
)

Case 2, using the same regex

import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
  String sourcestring = "foo sentence A";
  Pattern re = Pattern.compile("^\\s*?\\b(.*?)\\b\\s*?(?:\\b(and)\\b\\s*?\\b(.*?)\\b\\s*?)?$",Pattern.CASE_INSENSITIVE);
  Matcher m = re.matcher(sourcestring);
    if(m.find()){
      for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
        System.out.println( "[" + groupIdx + "] = " + m.group(groupIdx));
      }
    }
  }
}

$matches Array:
(
    [0] => foo sentence A
    [1] => foo sentence A
)
Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43
  • Thanks, it works! I see you used (?: .... ) but not clear the meaning. There is any well-written tutorial around about this? – martin.p May 26 '13 at 13:35
  • (?: starts a non capture group, this allows the ? at the end to make the group optional, while at the same time not placing the matching text in to returned group – Ro Yo Mi May 26 '13 at 13:38
  • I found also this: http://stackoverflow.com/questions/2973436/regex-lookahead-lookbehind-and-atomic-groups – martin.p May 27 '13 at 15:42
2

I'd use this regex:

^(.*?)(?: (AND) (.*))?$

explanation:

The regular expression:

(?-imsx:^(.*?)(?: (AND) (.*))?$)

matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  ^                        the beginning of the string
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    .*?                      any character except \n (0 or more times
                             (matching the least amount possible))
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  (?:                      group, but do not capture (optional
                           (matching the most amount possible)):
----------------------------------------------------------------------
                             ' '
----------------------------------------------------------------------
    (                        group and capture to \2:
----------------------------------------------------------------------
      AND                      'AND'
----------------------------------------------------------------------
    )                        end of \2
----------------------------------------------------------------------
                             ' '
----------------------------------------------------------------------
    (                        group and capture to \3:
----------------------------------------------------------------------
      .*                       any character except \n (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )                        end of \3
----------------------------------------------------------------------
  )?                       end of grouping
----------------------------------------------------------------------
  $                        before an optional \n, and the end of the
                           string
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------
Toto
  • 89,455
  • 62
  • 89
  • 125
0

Change your regex to make the space after he first sentence optional:

(.*\\S) ?(AND (.*))?$

Or you could use split() to consume the AND and any surrounding spaces:

String sentences = sentence.spli("\\s*AND\\s*");
Bohemian
  • 412,405
  • 93
  • 575
  • 722
0

your case 2 is a little strange...

but I would do

String[] parts = sentence.split("(?<=AND)|(?=AND)"));

you check the parts.length. if length==1, then it is case2. you just have the sentence in array, you could add empty string as your "group2/3"

if in case1 you have directly parts:

[foo sentence A , AND,  foo sentence B]
Kent
  • 189,393
  • 32
  • 233
  • 301