I'm want to tokenize a text, but not separating only with whitespaces.
There some things like proper names that I want to set only one token (eg.: "Renato Dinhani Conceição"). Another case: percentual ("60 %") and not split into two tokens.
What I want to know if there is a Tokenizator from some libray that can provide high customization? If not, I will try to write my own, if there is some interface or practices to follow.
Not everything need to be universal recognition. Example: I don't need to reconigze chinese alphabet.
My application is a college application and it is mainly directed to portuguese language. Only some things like names, places and similars will be from another languages.