You may try splitting with whitespaces or at the locations before non-word characters:
\s+|(?=\W)
See the regex demo
Pattern details: \s+|(?=\W)
contains two alternatives separated with |
symbol. \s+
matches 1 or more whitespaces that are removed when splitting. (?=\W)
is a positive lookahead that only matches an empty space before the pattern it contains - here, \W
matches any non-word character (not a letter, digit, or underscore).
NOTE: If a non-word \W
class is too "greedy" for you, you may use a punctuation class, \p{P}
(String pattern = "\\s+|(?=\\p{P})"
) to only split before punctuation.
IDEONE Java demo:
String str = "We have not met, have we?";
String[] chunks = str.split("\\s+|(?=\\W)");
System.out.println(Arrays.toString(chunks));
// => [We, have, not, met, ,, have, we, ?]
If you need to tokenize the non-whitespace/non-word chunks as whole units (say, ?!!
as one array element), use this matching technique:
Pattern ptrn = Pattern.compile("[^\\s\\W]+|\\S+");
Matcher m = ptrn.matcher("We have not met, have we?!!");
List<String> list = new ArrayList<>();
while (m.find()) {
list.add(m.group(0));
}
System.out.println(list); // => [We, have, not, met, ,, have, we, ?!!]
See another IDEONE demo and a regex demo.