NLP: retrieve vocabulary from text

Question

I have some texts in different languages and, potentially, with some typo or other mistake, and I want to retrieve their own vocabulary. I'm not experienced with NLP in general, so maybe I use some word improperly.

With vocabulary I mean a collection of words of a single language in which every word is unique and the inflections for gender, number, or tense are not considered (e.g. think, thinks and thought are are all consider think).

This is the master problem, so let's reduce it to the vocabulary retrieving of one language, English for example, and without mistakes.

I think there are (at least) three different approaches and maybe the solution consists of a combination of them:

search in a database of words stored in relation with each others. So, I could search for thought (considering the verb) and read the associated information that thought is an inflection of think
compute the "base form" (a word without inflections) of a word by processing the inflected form. Maybe it can be done with stemming?
use a service by any API. Yes, I accept also this approach, but I'd prefer to do it locally

For a first approximation, it's not necessary that the algorithm distinguishes between nouns and verbs. For instance, if in the text there were the word thought like both noun and verb, it could be considered already present in the vocabulary at the second match.

We have reduced the problem to retrieve a vocabulary of an English text without mistakes, and without consider the tag of the words.

Any ideas about how to do that? Or just some tips?

Of course, if you have suggestions about this problem also with the others constraints (mistakes and multi-language, not only Indo-European languages), they would be much appreciated.

@VsevolodDyomkin thanks for your interest. I've found that Wiktionary hasn't a set of rigid rules for the information. It has some guidelines, but these don't guarantee a defined structure (as said in [Entry layout explained, Flexibility](https://en.wiktionary.org/wiki/Wiktionary:Entry_layout_explained#Flexibility)). Do you know others databases with a strict structure? — Giacomo, Mar 26 '15 at 14:19
Yes, wiktionary is semi-structured, but you still can extract word forms from wiktionary definitions (here's some example code how you can process them - http://lisp-univ-etc.blogspot.com/2013/06/nltk-21-working-with-text-corpora.html, but you can also look into different tools like wiktionary-to-mysql, wiktionary-to-redis or wiktionary-to-dbpedia) — Vsevolod Dyomkin, Mar 26 '15 at 15:16
Looks like it should be http://lisp-univ-etc.blogspot.com/2013/06/nltk-21-working-with-text-corpora.html (I can't tell the difference, but this one works for me?) — tripleee, Mar 26 '15 at 16:18

score 2 · Answer 1 · edited May 23 '17 at 11:55

You need lemmatization - it's similar to your 2nd item, but not exactly (difference).

Try nltk lemmatizer for Python or Standford NLP/Clear NLP for Java. Actually nltk uses WordNet, so it is really combination of 1st and 2nd approaches.

In order to cope with mistakes use spelling correction before lemmatization. Take a look at related questions or Google for appropriate libs.

About part of speech tag - unfortunately, nltk doesn't consider POS tag (and context in general), so you should provide it with the tag that can be found by nltk pos tagging. Again, it is already discussed here (and related/linked questions). I'm not sure about Stanford NLP here - I guess it should consider context, but I was sure that NLTK does so. As I can see from this code snippet, Stanford doesn't use POS tags, while Clear NLP does.

About other languages - google for lemmatization models, since algorithm for most languages (at least from the same family) is almost the same, differences are in training data. Take a look here for example of German; it is a wrapper for several lemmatizers, as I can see.

However, you always can use stemmer at cost of precision, and stemmer is more easily available for different languages.

The lemmatization is a very useful tip. Thank you. – Giacomo Mar 27 '15 at 16:06 — Giacomo, Mar 27 '15 at 16:06

score 1 · Answer 2 · answered Aug 30 '20 at 05:31

Topic Word has become an integral part of the rising debate in the present world. Some people perceive that Topic Word (Synonyms) beneficial, while opponents reject this notion by saying that it leads to numerous problems. From my point of view, Topic Word (Synonyms) has more positive impacts than negative around the globe. This essay will further elaborate on both positive and negative effects of this trend and thus will lead to a plausible conclusion.

On the one hand, there is a myriad of arguments in favour of my belief. The topic has a plethora of merits. The most prominent one is that the Topic Word (Synonyms). According to the research conducted by Western Sydney University, more than 70 percentages of the users were in favour of the benefits provided by the Topic Word (Synonyms). Secondly, Advantage of Essay topic. Thus, it can say that Topic Word (Synonyms) plays a vital role in our lives.

On the flip side, critics may point out that one of the most significant disadvantages of the Topic Word (Synonyms) is that due to Demerits relates to the topic. For instance, a survey conducted in the United States reveals that demerit. Consequently, this example explicit shows that it has various negative impacts on our existence.

As a result, after inspection upon further paragraphs, I profoundly believe that its benefits hold more water instead of drawbacks. Topic Word (Synonyms) has become a crucial part of our life. Therefore, efficient use of Topic Word (Synonyms) method should promote; however, excessive and misuse should condemn.

NLP: retrieve vocabulary from text

2 Answers2