3

I want to tokenize a sentence that has adjacent words, as following:

"This is a samplestring that Iwanttotokenize."

In above example, there are two cases "samplestring" & "Iwanttotokenize" where adjacent words appear. Any idea how to make tokens of these words?

For this sentence, ideal output should be (one token per line): This is a sample string that I want to tokenize

Imran
  • 31
  • 2
  • 5
    How would you differentiate between words that are slammed together versus words that legitimately contain two words? – Dave Newton Jan 21 '13 at 19:43
  • What I'm saying is that without a means of determining what's valid/invalid this is impossible. You would need to contextually analyze surrounding text to determine if a compound word was "valid", or when a word didn't exist in the dictionary, decide *how* to break it up into individual words, which might again be context-dependent. – Dave Newton Jan 21 '13 at 20:00
  • You need some rules. How will you distinguish if you should split "without" to "with out" or not? What is your use case? If it is a general use-case you'll need natural language processing tools. This is a huuuuge topic. Start here http://stackoverflow.com/questions/870460/is-there-a-good-natural-language-processing-library – Piotr Gwiazda Jan 21 '13 at 20:13
  • You should do it probabilistically: e.g. "without" will appear more often in normal text than "with out" – kutschkem Jan 21 '13 at 20:30
  • Twitter is a particularly hard case. You are not even dealing with a fixed language, but must also take into account recognising other languages, acronyms/initialisms, spelling mistakes and made up words which will be impossible to parse because you will have no background. It would be practical to 'spell-check' each one, but there are many occasions where the first result of even a very good spell checker is not correct. And even a good spellchecker probably can't deal with "Iwanttotokenize" – Philip Whitehouse Jan 21 '13 at 21:09
  • Oh and just to head you off - doing anything more then a % chance for each word (which will be wrong for a large amount of time) means you store n(n-1) probabilities. – Philip Whitehouse Jan 21 '13 at 21:13

1 Answers1

1

I'd suggest using a word list like http://www.sil.org/linguistics/wordlists/english If memory allows, pack it into a HashSet and use the function contains() (optimized for hash comparison)

First, tokenize the string using StringTokenizer. For each token, check if it starts and/or ends with a word from the list. If it starts and ends with a word of that list, and there are no letters left insert spaces in the original string where appropriate and tokenize again.

ratlan
  • 341
  • 2
  • 12