I am looking for some libraries which would help me do the following:
For a given input text document: 1. Convert the document to lower case (Easy. Solved with toLowerCase function) 2. Remove symbols 3. Tokenize resulting in a list of words
Eg: "A,B; C\nD. F" should result in ["a", "b", "c", "d", "e", "f"] .
It should work with all languages. I have some russian, chines and japanese text in addition to english.
Here is what I have tried:
The solution mentioned in Replacing all non-alphanumeric characters with empty strings can be easily adapted to my problem if I was dealing with only english.
java.util.StringTokenizer kind of works but it will not remove symbols.
Here is what I am looking for: An elegant way to perform all these three operations. Not looking for elaborate (i.e. length) code that does it (I can wrote it myself if there is no elegant solution).