I have some texts in different languages and, potentially, with some typo or other mistake, and I want to retrieve their own vocabulary. I'm not experienced with NLP in general, so maybe I use some word improperly.
With vocabulary I mean a collection of words of a single language in which every word is unique and the inflections for gender, number, or tense are not considered (e.g. think, thinks and thought are are all consider think).
This is the master problem, so let's reduce it to the vocabulary retrieving of one language, English for example, and without mistakes.
I think there are (at least) three different approaches and maybe the solution consists of a combination of them:
- search in a database of words stored in relation with each others. So, I could search for thought (considering the verb) and read the associated information that thought is an inflection of think
- compute the "base form" (a word without inflections) of a word by processing the inflected form. Maybe it can be done with stemming?
- use a service by any API. Yes, I accept also this approach, but I'd prefer to do it locally
For a first approximation, it's not necessary that the algorithm distinguishes between nouns and verbs. For instance, if in the text there were the word thought like both noun and verb, it could be considered already present in the vocabulary at the second match.
We have reduced the problem to retrieve a vocabulary of an English text without mistakes, and without consider the tag of the words.
Any ideas about how to do that? Or just some tips?
Of course, if you have suggestions about this problem also with the others constraints (mistakes and multi-language, not only Indo-European languages), they would be much appreciated.