-1

I need to split some sentences into words.

For example:

Upper sentence.
Lower sentence. And some text.

I do it by:

String[] words = text.split("(\\s+|[^.]+$)");

But the output I get is:

Upper, sentence.Lower, sentence., And, some, text.

And it should be like:

Upper, sentence., Lower, sentence., And, some, text.

Notice that I need to preserve all the characters (.,-?! etc.)

candylady
  • 57
  • 1
  • 9

5 Answers5

5

in regular expressions \W+ match one or more non word characters.

http://www.vogella.com/tutorials/JavaRegularExpressions/article.html

So if you want to get the words in the sentences you can use \W+ as the splitter.

String[] words = text.split("\\W+");

this will give you following output.

Upper
sentence
Lower
sentence
And
some
text

UPDATE : Since you have updated your question, if you want to preserve all characters and split by spaces, use \s+ as the splitter.

String[] words = text.split("\\s+");

I have checked following code block and confirmed that it is working with new lines too.

String text = "Upper sentence.\n" +
            "Lower sentence. And some text.";
    String[] words = text.split("\\s+");
    for (String word : words){
        System.out.println(word);
    }
Chathura Buddhika
  • 2,067
  • 1
  • 21
  • 35
1

The expression \\s+ means "1 or more whitespace characters". I think what you need to do is replace this by \\s*, which means "zero or more whitespace characters".

Jeroen Steenbeeke
  • 3,884
  • 5
  • 17
  • 26
1

You can split the string into sub strings using the following line of code:

String[] result = speech.split("\\s");

For reference: https://alvinalexander.com/java/edu/pj/pj010006

Clijsters
  • 4,031
  • 1
  • 27
  • 37
Sreejesh K Nair
  • 563
  • 1
  • 6
  • 16
1

Replace dots, commas, etc... for a white space and split that for whitespace

String text = "hello.world this   is.a sentence.";
String[] list = text.replaceAll("\\.", " " ).split("\\s+");
System.out.println(new ArrayList<>(Arrays.asList(list)));

Result: [hello, world, this, is, a, sentence]

Edit:

If is only for dots this trick should work...

String text = "hello.world this   is.a sentence.";
String[] list = text.replaceAll("\\.", ". " ).split("\\s+");
System.out.println(new ArrayList<>(Arrays.asList(list)));

[hello., world, this, is., a, sentence.]

FranAguiar
  • 637
  • 3
  • 14
  • 31
1

Simple answer for updated question

    String text = "Upper sentence.\n"+
            "Lower sentence. And some text.";

[just space] one or more OR new lines one or more

    String[] arr1 = text.split("[ ]+|\n+");
    System.out.println(Arrays.toString(arr1));

result:

 [Upper, sentence., Lower, sentence., And, some, text.]
Micah
  • 92
  • 10