-2

What regex pattern would i need to pass to String.split() method to split a string into an array of sub strings using the white space as well as the following characters as delimiters. (" ! ", " , " , " ? " , " . " , " \ " , " _ " , " @ " , " ' " ) and it can also be the combination of the above characters with whitespace. I've tried something like this:

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.*;
class StringWordCount {
    public static void main(String[] args) throws IOException {

      BufferedReader bufferedReader = new BufferedReader(new IputStreamReader(System.in));
      String string = bufferedReader.readLine();
      String delimiter = "[,\\s]+|\\[!\\s]+|\\[?\\s]+|\\[.\\s]+|\\[_\\s]+|\\[_\\s]+|\\['\\s]+|\\[@\\s]+|\\!|\\,|\\?|\\.|\\_|\\'|\\@";
      String[] words = string.split(delimiter);
      System.out.println(words.length);
      for(int i = 0; i<words.length; i++) {
         System.out.println(words[i]);
      }
}

}

The above code only generates correct output for some testcases, in other cases, it won't generate the correct one.For example, Consider the below string where it failed to get the expected output.

It generates the output:

23
Hello
thanks
for
attempting
this
problem

Hope
it
will
help
you
to
learn
java

Good
luck
and
have
a
nice
day

Instead of this one:

21
Hello
thanks
for
attempting
this
problem
Hope
it
will
help
you
to
learn
java
Good
luck
and
have
a
nice
day

As you can see in the first output, its leaving a space on the combination of " ! " and [space] and the delimiter for the above combination is \\[!\\s], right?

akash
  • 22,664
  • 11
  • 59
  • 87
Batman25663
  • 272
  • 1
  • 3
  • 12
  • 4
    Possible duplicate of [How to split a string in Java](http://stackoverflow.com/questions/3481828/how-to-split-a-string-in-java) – Tushar Dec 17 '15 at 06:19
  • 1
    @Tushar and others: The question you're calling this a "duplicate" of was posted by someone who didn't know about `split()`. This questioner knows about `split` and is having trouble getting the delimiter right. This is not a duplicate. – ajb Dec 17 '15 at 06:39
  • StringTokenizer is more suitable under the given scenario. though it is superseded with Scanner and split method. – Tech Enthusiast Dec 17 '15 at 07:39

3 Answers3

4

You can try this one:

String str = "Hello, thanks for attempting this problem! Hope it will help you to learn java! Good luck and have a nice day!";
//String[] split = str.split("[\\p{Punct}\\s+]");
String[] split = str.split("[\\p{Punct}\\p{Blank}]+");
System.out.println("Arrays.toString(split) = " + Arrays.toString(split));

Result is:

Arrays.toString(split) = [Hello, thanks, for, attempting, this, problem, Hope, it, will, help, you, to, learn, java, Good, luck, and, have, a, nice, day]

Eited: edited line below

String[] split = str.split("[\\p{Punct}\\p{Blank}]+");
Bahramdun Adil
  • 5,907
  • 7
  • 35
  • 68
2

In this line:

String delimiter = "[,\\s]+|\\[!\\s]+|\\[?\\s]+|\\[.\\s]+|\\[_\\s]+|\\[_\\s]+|\\['\\s]+|\\[@\\s]+|\\!|\\,|\\?|\\.|\\_|\\'|\\@";

you have \\[ in the string literal, which means the pattern has two characters \[ in it. In the pattern matcher, this causes the matcher to look for the [ character. This isn't what you want.

When a \ character appears in a pattern string:

  1. If the following character is a letter or digit, the combination has some special meaning (for example, you're using \s in the string to mean whitespace), but:
  2. If the following character is something other than a letter or a digit, this means to treat the following character as itself. Any special meaning the character may have had is canceled.

It looks like you're trying to use [!\s]+ (in the pattern; of course you had to double the backslash in the string literal) to match one or more characters in the set of ! and whitespace. Here, [ and ] have a special meaning, to match any character in a set. But putting \ before the [ cancels the special meaning of [, and causes the matcher to look for a [ in the input, which it doesn't find.

See this javadoc for more information.

I'm not sure, but I think getting rid of all the \\ before each [ will make things work. The pattern will still be more complicated than necessary (and I'm not 100% clear on what the requirements are, so it's hard for me to suggest an improvement).

ajb
  • 31,309
  • 3
  • 58
  • 84
  • Thanks @ajb .Sorry for not specifying the exact requirements. as i've just started working on Java. I should have seen the javadoc before attempting this problem.Getting rid of those '\\' before each '[' has been precisely working for all the possible testcases.Thanks again. :) – Batman25663 Dec 17 '15 at 07:39
0

Just do matching instead of splitting..

ArrayList<String> lst = new ArrayList<String>();
Matcher m = Pattern.compile("\\w+").matcher(s);
while(m.find()) {
    lst.add(m.group());
  }
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274