Is there any Java-based intelligent word tokenizer that makes tokens of adjacent words in a sentence?

Question

I want to tokenize a sentence that has adjacent words, as following:

"This is a samplestring that Iwanttotokenize."

In above example, there are two cases "samplestring" & "Iwanttotokenize" where adjacent words appear. Any idea how to make tokens of these words?

For this sentence, ideal output should be (one token per line): This is a sample string that I want to tokenize

How would you differentiate between words that are slammed together versus words that legitimately contain two words? — Dave Newton, Jan 21 '13 at 19:43
What I'm saying is that without a means of determining what's valid/invalid this is impossible. You would need to contextually analyze surrounding text to determine if a compound word was "valid", or when a word didn't exist in the dictionary, decide *how* to break it up into individual words, which might again be context-dependent. — Dave Newton, Jan 21 '13 at 20:00
You need some rules. How will you distinguish if you should split "without" to "with out" or not? What is your use case? If it is a general use-case you'll need natural language processing tools. This is a huuuuge topic. Start here http://stackoverflow.com/questions/870460/is-there-a-good-natural-language-processing-library — Piotr Gwiazda, Jan 21 '13 at 20:13
You should do it probabilistically: e.g. "without" will appear more often in normal text than "with out" — kutschkem, Jan 21 '13 at 20:30
Twitter is a particularly hard case. You are not even dealing with a fixed language, but must also take into account recognising other languages, acronyms/initialisms, spelling mistakes and made up words which will be impossible to parse because you will have no background. It would be practical to 'spell-check' each one, but there are many occasions where the first result of even a very good spell checker is not correct. And even a good spellchecker probably can't deal with "Iwanttotokenize" — Philip Whitehouse, Jan 21 '13 at 21:09
Oh and just to head you off - doing anything more then a % chance for each word (which will be wrong for a large amount of time) means you store n(n-1) probabilities. — Philip Whitehouse, Jan 21 '13 at 21:13

score 1 · Answer 1 · answered Jan 23 '13 at 13:21

I'd suggest using a word list like http://www.sil.org/linguistics/wordlists/english If memory allows, pack it into a HashSet and use the function contains() (optimized for hash comparison)

First, tokenize the string using StringTokenizer. For each token, check if it starts and/or ends with a word from the list. If it starts and ends with a word of that list, and there are no letters left insert spaces in the original string where appropriate and tokenize again.

Is there any Java-based intelligent word tokenizer that makes tokens of adjacent words in a sentence?

1 Answers1