1

I'm want to tokenize a text, but not separating only with whitespaces.

There some things like proper names that I want to set only one token (eg.: "Renato Dinhani Conceição"). Another case: percentual ("60 %") and not split into two tokens.

What I want to know if there is a Tokenizator from some libray that can provide high customization? If not, I will try to write my own, if there is some interface or practices to follow.

Not everything need to be universal recognition. Example: I don't need to reconigze chinese alphabet.

My application is a college application and it is mainly directed to portuguese language. Only some things like names, places and similars will be from another languages.

Renato Dinhani
  • 35,057
  • 55
  • 139
  • 199
  • 5
    If you manage to solve this problem in a simple programmatic way, I'm pretty sure you'll have an ACM Turing Award waiting for you. – fluffy Jul 28 '11 at 18:34
  • Indeed, it seems that this is trying to do something with a static algorithm that really requires some sort of knowledge base and AI. Actually, you can already see this is not going to happen since there is no such thing as an algorithm for solving an open question. – Maarten Bodewes Jul 28 '11 at 18:51
  • 3
    Proper name detection, aka Named Entity Recognition, is an active field of research. This is not tokenization, this is full-blown NLP. – Fred Foo Jul 29 '11 at 09:37
  • 1
    To learn more about NLP/Information Extraction, please take a look on this post - http://stackoverflow.com/questions/573620/how-to-get-started-on-information-extraction and to try out existing methods in NLP - I would advice to use nltk toolkit and read the corresponding book (avail. on the nltk page) -- http://www.nltk.org/. – Skarab Jul 29 '11 at 10:33

4 Answers4

2

I would try to go about it not from a tokenization perspective, but from a rules perspective. This will be the biggest challenge - creating a comprehensive rule set that will satisfy most of your cases.

  • Define in human terms what are units that should not be split up based on whitespace. The name example is one.
  • For each one of those exceptions to the whitespace split, create a set of rules for how to identify it. For the name example: 2 or more consecutive capitalized words with or without language specific non-capitalized name words in between (like "de").
  • Implement each rule as its own class which can be called as you loop.
  • Split the entire string based on whitespace, and then loop it, keeping track of what token came before, and what is current, applying your rule classes for each token.

Example for rule isName:

  • Loop 1: (eg.: isName = false
  • Loop 2: "Renato isName = true
  • Loop 3: Dinhani isName = true
  • Loop 4: Conceição"). isName = true
  • Loop 5: Another isName = false

Leaving you with: (eg.:, "Renato Dinhani Conceição")., Another

atrain
  • 9,139
  • 1
  • 36
  • 40
  • I liked this idea. Two questions about it: there are some isSomething() implemented in any libray? Is better do the validations with regex or running the String char by char? – Renato Dinhani Jul 28 '11 at 22:50
  • Been searching, haven't found one yet. Also note that even if a lib exists, you might still need (probably will need) to implement your own rules. – atrain Jul 29 '11 at 12:30
  • 1
    He could just use a Named Entity Recognizer first, then treat every identified entity as one token and everything that isn't an entity as separate tokens. This moves away from rules, automates the identification of proper nouns, and all that's left is a regular expression. – dmn Jul 29 '11 at 21:08
  • Good call - I hadn't thought of NLP. Here's a Stack Overflow page on NER: http://stackoverflow.com/questions/188176/named-entity-recognition-libraries-for-java. OpenCalais was the selected option. Nice! – atrain Jul 29 '11 at 21:19
1

I think that a tokenizer is going to be too simplistic for what you want. One step up from a tokenizer would be a lexer like JFlex. These will split up a stream of characters into separate tokens likea tokenizer but with much more flexible rules.

Even so, it seems like you're going to need some sort of natural language processing, as teaching a lexer the difference between a proper name and normal words might be tricky. You might be able to get pretty far by teaching it that a string of words that start with upper-case letters all belong together, numbers may be followed by units, etc. Good luck.

CarlG
  • 1,656
  • 1
  • 17
  • 21
1

You should try Apache OpenNLP. It includes ready to use Sentence Detector and Tokenizer models for Portuguese.

Download Apache OpenNLP and extract it. Copy the Portuguese model to the OpenNLP Folder. Download the model from http://opennlp.sourceforge.net/models-1.5/

Using it from command line:

bin/opennlp TokenizerME pt-ten.bin 
Loading Tokenizer model ... done (0,156s)
O José da Silva chegou, está na sua sala.
O José da Silva chegou , está na sua sala .

Using the API:

// load the model
InputStream modelIn = new FileInputStream("pt-token.bin");

try {
  TokenizerModel model = new TokenizerModel(modelIn);
}
catch (IOException e) {
  e.printStackTrace();
}
finally {
  if (modelIn != null) {
    try {
      modelIn.close();
    }
    catch (IOException e) {
    }
  }
}

// load the tokenizer
Tokenizer tokenizer = new TokenizerME(model);

// tokenize your sentence
String tokens[] = tokenizer.tokenize("O José da Silva chegou, está na sua sala.");
wcolen
  • 1,401
  • 10
  • 15
0

StringTokenizer is a legacy class that is maintained only for backward compatibility. It's use is discouraged in new code.

You should use the String.split() function. The split function takes a regular expression as it's argument. Additionally, you can enhance it with using the Pattern and Matcher classes. You can compile your pattern objects and then use it to match various scenarios.

Basanth Roy
  • 6,272
  • 5
  • 25
  • 25
  • Human names are not regular, so it's not possible to identify a full name as a token using this method. Names were a particular use case identified in the original question. – Thomas Owens Jul 28 '11 at 18:30
  • I doubt the poster of this question was asking for a universal way to split all his strings. What he was asking sounded more like a best practice in this scenario. – Basanth Roy Jul 28 '11 at 18:35
  • That could be. I interpreted the question as looking for some uberpoweful framework for string processing. I suppose it depends on how you read the question. – Thomas Owens Jul 28 '11 at 18:37
  • The two examples I posted are the first things I have saw in my first tests, but new situations can occur. – Renato Dinhani Jul 28 '11 at 18:38