0

I am looking for some libraries which would help me do the following:

For a given input text document: 1. Convert the document to lower case (Easy. Solved with toLowerCase function) 2. Remove symbols 3. Tokenize resulting in a list of words

Eg: "A,B; C\nD. F" should result in ["a", "b", "c", "d", "e", "f"] .

It should work with all languages. I have some russian, chines and japanese text in addition to english.

Here is what I have tried:

The solution mentioned in Replacing all non-alphanumeric characters with empty strings can be easily adapted to my problem if I was dealing with only english.

java.util.StringTokenizer kind of works but it will not remove symbols.

Here is what I am looking for: An elegant way to perform all these three operations. Not looking for elaborate (i.e. length) code that does it (I can wrote it myself if there is no elegant solution).

Community
  • 1
  • 1
ElKamina
  • 7,747
  • 28
  • 43

1 Answers1

1

Have you tried using String.split() with a regex that uses symbols and whitespace as delimiters?

Something along the lines of this.

document.toLowerCase().split(["\\p{Punct}\\s]+");

where \p{Punct} is equal to !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ according to the Pattern.

This will remove all symbols and white space and return an array of strings that would be your tokenised list without any symbols.

FDinoff
  • 30,689
  • 5
  • 75
  • 96
  • @ElKamina did you include the `\\s` and `+` in the regex? The `+` should match 1 or more characters that are part of the set. Note: `\\s` matches all white space characters – FDinoff Apr 09 '13 at 20:33