2

I'm trying to split a string into "sentences" but I'm having an issue with trailing words. For example:

"This isn't cool. This doesn't work. This"

should split into

[This is cool., This doesn't work., This]

So far I've been using "[^\\.!?]*[\\.\\s!?]+" but can't figure out how to adjust this for the trailing word since there is no terminating character and thus nothing to look for. Is there something I can add or do I need to adjust this completely?

Veer Singh
  • 913
  • 2
  • 11
  • 26
cjcamisa
  • 129
  • 1
  • 6
  • I'm on mobile and can't test this, but you should try adding the end of string meta character `$` to your second series of characters. – tblznbits Nov 01 '15 at 18:02
  • I won't vote to close it as duplicate of [Split string into sentences based on periods](http://stackoverflow.com/questions/2687012/split-string-into-sentences-based-on-periods) since your title explicitly states that you want to use regex, but consider using other tools for described problem. – Pshemo Nov 01 '15 at 18:12

3 Answers3

2
String s = "This isn't cool. This doesn't work. This";
System.out.println(Arrays.toString(s.split("\\. ")));

Produces:

[This isn't cool, This doesn't work, This]
kukis
  • 4,489
  • 6
  • 27
  • 50
1

Instead of splitting the string you can find all sentences and for matching the trailing sentence you can use anchor $ which will match the end of the string:

List<String> sentences = new ArrayList<String>();
 Matcher m = Pattern.compile("[^?!.]+(?:[.?!]|$)")
     .matcher("This isn't cool. This doesn't work. This");
 while (m.find()) {
   sentences.add(m.group());
 }
Mazdak
  • 105,000
  • 18
  • 159
  • 188
0

You can safely change the last + to a * as well.

Regexes are by default greedy, and each separate part will grab as much data as possible. That means that the first subexpression will match

This isn't cool

and the next part the period and space - and nothing more. Changing the plus to an asterisk will not change this behavior. Inside the string, all sentence ending characters will get matched, and at the end there is nothing left to match - which is valid with a *.

Jongware
  • 22,200
  • 8
  • 54
  • 100