0

Im trying to tokenize a string input, but I cant get my head around how to do it. The Idea is, to split the string into instances of alphabetical words and non alphabetical symbols. For example the String "Test, ( abc)" would be split into ["Test" , "," , "(" , "abc" , ")" ].

Right now I use this regular Expression: "(?<=[a-zA-Z])(?=[^a-zA-Z])" but it doesnt do what I want.

Any ideas what else I could use?

4 Answers4

2

I see that you want to group the alphabets (like Test and abc) but no grouping of the non-alphabetical characters. Also I see that you do not want to show space char. For this I will use "(\\w+|\\W)" after removing all spaces from the strings to match.

Sample code

String str = "Test, ( abc)";
str = str.replaceAll(" ",""); // in case you do not want space as separate char.
Pattern pattern = Pattern.compile("(\\w+|\\W)");
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
    System.out.println(matcher.group());
}

Output

Test , ( abc ) I hope this answers your question.

Nikhil Yekhe
  • 36
  • 1
  • 7
  • He modified the code after I posted. Earlier the code did not cater for the non-alpha characters (his regex was **\\w+** only, which is not correct answer. The question was not the code itself but the regex used. The code was used to illustrate the regex.). I hope this clarifies @GCP. – Nikhil Yekhe Dec 26 '17 at 18:05
0

Try this:

String s = "I want to walk my dog, and why not?";
Pattern pattern = Pattern.compile("(\\w+|\\W)");
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
    System.out.println(matcher.group());
}

Outputs:

I
want
to
walk
my
dog
,
and
why
not
?

\w can be used to match word characters ([A-Za-z0-9_]), so that punctuation is removed from the results

(Taken from: here)

GuyKhmel
  • 505
  • 5
  • 15
0

Try this:

public static ArrayList<String> res(String a) {
        String[] tokens = a.split("\\s+");
        ArrayList<String> strs = new ArrayList<>();
        for (String token : tokens) {
            String[] alpha = token.split("\\W+");
            String[] nonAlpha = token.split("\\w+");
            for (String str : alpha) {
                if (!str.isEmpty()) strs.add(str);
            }
            for (String str : nonAlpha) {
                if (!str.isEmpty()) strs.add(str);
            }
        }
        return strs;
    }
mehdi maick
  • 325
  • 3
  • 7
0

I guess in the simplest form, split using

"(?<=[a-zA-Z])(?=[^\\sa-zA-Z])|(?<=[^\\sa-zA-Z])(?=[a-zA-Z])|\\s+"

Explained

    (?<= [a-zA-Z] )               # Letter behind
    (?= [^\sa-zA-Z] )             # not letter/wsp ahead
 |                              # or,
    (?<= [^\sa-zA-Z] )            # Not letter/wsp behind
    (?= [a-zA-Z] )                # letter ahead
 |                              # or,
    \s+                           # whitespaces (disgarded)