General tokenizer

Question

I am looking for some libraries which would help me do the following:

For a given input text document: 1. Convert the document to lower case (Easy. Solved with toLowerCase function) 2. Remove symbols 3. Tokenize resulting in a list of words

Eg: "A,B; C\nD. F" should result in ["a", "b", "c", "d", "e", "f"] .

It should work with all languages. I have some russian, chines and japanese text in addition to english.

Here is what I have tried:

The solution mentioned in Replacing all non-alphanumeric characters with empty strings can be easily adapted to my problem if I was dealing with only english.

java.util.StringTokenizer kind of works but it will not remove symbols.

Here is what I am looking for: An elegant way to perform all these three operations. Not looking for elaborate (i.e. length) code that does it (I can wrote it myself if there is no elegant solution).

FDinoff · Accepted Answer · 2013-04-09T20:19:03.190

1

Have you tried using String.split() with a regex that uses symbols and whitespace as delimiters?

Something along the lines of this.

document.toLowerCase().split(["\\p{Punct}\\s]+");

where \p{Punct} is equal to !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ according to the Pattern.

This will remove all symbols and white space and return an array of strings that would be your tokenised list without any symbols.

edited Apr 09 '13 at 20:19

answered Apr 09 '13 at 20:02

FDinoff

30,689
5
75
96

@ElKamina did you include the `\\s` and `+` in the regex? The `+` should match 1 or more characters that are part of the set. Note: `\\s` matches all white space characters – FDinoff Apr 09 '13 at 20:33

General tokenizer

1 Answers1