Rather than splitting, you should prefer to use find
to find all the tokens as you want with this regex,
[a-zA-Z]+(['][a-zA-Z]+)?
This regex will only allow sandwiching a single '
within it. If you want to allow any other such character, just place it within the character set [']
and right now it will allow only once and in case you want to allow multiple times, you will have to change ?
at the end with a *
to make it zero or more times.
Checkout your modified Java code,
List<String> tokenList = new ArrayList<String>();
String str = "..Hello ?don't #$you %know?";
Pattern p = Pattern.compile("[a-zA-Z]+(['][a-zA-Z]+)?");
Matcher m = p.matcher(str);
while (m.find()) {
tokenList.add(m.group());
}
String[] strArray = tokenList.toArray(new String[tokenList.size()]);
for (int i = 0; i < strArray.length; i++) {
System.out.println(strArray[i] + i);
}
Prints,
Hello0
don't1
you2
know3
However, if you insist on using split
method only, then you can use this regex to split the values,
[^a-zA-Z]*\\s+[^a-zA-Z]*|[^a-zA-Z']+
Which basically splits the string on one or more white space optionally surrounded by non-alphabet characters or split by sequence of one or more non-alphabet and non single quote character. Here is the sample Java code using split,
String str = ".. Hello ?don't #$you %know?";
String[] strArray = Arrays.stream(str.split("[^a-zA-Z]*\\s+[^a-zA-Z]*|[^a-zA-Z']+")).filter(x -> x.length()>0).toArray(String[]::new);
for (int i = 0; i < strArray.length; i++) {
System.out.println(strArray[i] + i);
}
Prints,
Hello0
don't1
you2
know3
Notice here, I have used filter method on streams to filter tokens of zero length as split may generate zero length tokens at the start of array.