I'm using Python with nltk. I need to process some text in English without any whitespace, but word_tokenize function in nltk couldn't deal with problems like this. So how to tokenize text without any whitespace. Is there any tools in Python?
Asked
Active
Viewed 2,918 times
7
-
Why are there no spaces? What is the domain? – Jared Jul 14 '13 at 06:43
-
How do you identify a word? – Burhan Khalid Jul 14 '13 at 06:56
-
Unless you're scanning the text letter by letter and test all possible combinations of continuous characters, there has to be a delimiter – Yotam Jul 14 '13 at 06:56
-
2This is an interesting algorithm problem! I don't know why it's being downvoted. – picomancer Jul 14 '13 at 07:55
-
I saw the same problem [here](https://stackoverflow.com/questions/49499770/nltk-word-tokenizer-treats-ending-single-quote-as-a-separate-word/49506436#49506436) . Hope it can help you – Nambi Jun 19 '19 at 19:09
-
Does this answer your question? [How can I split multiple joined words?](https://stackoverflow.com/questions/195010/how-can-i-split-multiple-joined-words) – polm23 Apr 10 '22 at 05:31
1 Answers
2
I am not aware of such tools, but the solution of your problem depends on the language.
For the Turkish language you can scan input text letter by letter and accumulate letters into a word. When you are sure that accumulated word forms a valid word from a dictionary, you save it as a separate token, erase the buffer for accumulating new word and continue the process.
You can try this for English, but I assume that you may find situations when ending of one word may be a beginning of some dictionary word, and this can cause you some problems.

Ivan Mushketyk
- 8,107
- 7
- 50
- 67