1

I can't find the exact way to solve this issue I'm having. I want to split a sentence which will have spaces and can have punctuation marks. I want to keep the words and punctuation marks and store them in a single array.

 Example sentence;
 We have not met, have we?

 Desired array;
{"We", "have", "not", "met", ",", "have", "we", "?"}

I'm trying to split the sentence in a single String split method. I've looked through other related questions on stack overflow and I have't be able to get a regex which caters for me, especially for the question mark.

Alex Conroy
  • 13
  • 1
  • 5
  • 2
    http://stackoverflow.com/questions/2206378/how-to-split-a-string-but-also-keep-the-delimiters – Reimeus May 04 '16 at 22:40
  • 1
    @Alex Conroy Try and look around and see if someone's asked a similar question first. There's more than a few that cover this, like the above and http://stackoverflow.com/questions/3777546/how-can-i-split-a-string-in-java-and-retain-the-delimiters – Tibrogargan May 04 '16 at 22:41
  • Thanks for link @Tibrogargan, I actually looked up that question before hand but it didn't work for me. I tweaked the solution(s) from that question and it worked with everything expect for the question mark, I was receiving error messages for the question mark. – Alex Conroy May 04 '16 at 23:31

2 Answers2

2

You may try splitting with whitespaces or at the locations before non-word characters:

\s+|(?=\W)

See the regex demo

Pattern details: \s+|(?=\W) contains two alternatives separated with | symbol. \s+ matches 1 or more whitespaces that are removed when splitting. (?=\W) is a positive lookahead that only matches an empty space before the pattern it contains - here, \W matches any non-word character (not a letter, digit, or underscore).

NOTE: If a non-word \W class is too "greedy" for you, you may use a punctuation class, \p{P} (String pattern = "\\s+|(?=\\p{P})") to only split before punctuation.

IDEONE Java demo:

String str = "We have not met, have we?"; 
String[] chunks = str.split("\\s+|(?=\\W)");
System.out.println(Arrays.toString(chunks));
// => [We, have, not, met, ,, have, we, ?]

If you need to tokenize the non-whitespace/non-word chunks as whole units (say, ?!! as one array element), use this matching technique:

Pattern ptrn = Pattern.compile("[^\\s\\W]+|\\S+");
Matcher m = ptrn.matcher("We have not met, have we?!!");
List<String> list = new ArrayList<>();
while (m.find()) {
    list.add(m.group(0));
}
System.out.println(list); // => [We, have, not, met, ,, have, we, ?!!]

See another IDEONE demo and a regex demo.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Thanks for the simple solution and making it clear. Also cheers for the regex demo link, that'll be a life saver. – Alex Conroy May 04 '16 at 23:23
0
String sentence="We have not met, have we ?";
String[] splited = sentence.split("\\s+");
suulisin
  • 1,414
  • 1
  • 10
  • 17