I'm working on a little hobby Python project that involves creating dictionaries for various languages using large bodies of text written in that language. For most languages this is relatively straightforward because I can use the space delimiter between words to tokenize a paragraph into words for the dictionary, but for example, Chinese does not use a space character between words. How can I tokenize a paragraph of Chinese text into words?
My searching has found that this is a somewhat complex problem, so I'm wondering if there are off the shelf solutions to solve this in Python or elsewhere via an api or any other language. This must be a common problem because any search engine made for asian languages would need to overcome this issue in order to provide relevant results.
I tried to search around using Google, but I'm not even sure what this type of tokenizing is called, so my results aren't finding anything. Maybe just a nudge in the right direction would help.