31

Word2vec seems to be mostly trained on raw corpus data. However, lemmatization is a standard preprocessing for many semantic similarity tasks. I was wondering if anybody had experience in lemmatizing the corpus before training word2vec and if this is a useful preprocessing step to do.

Jérôme Bau
  • 707
  • 5
  • 16
Luca Fiaschi
  • 3,145
  • 7
  • 31
  • 44
  • you mean word2vec from gensim? – alvas May 27 '14 at 15:17
  • yes, but also in general the word2vec algorithm – Luca Fiaschi May 28 '14 at 11:09
  • For instance, Turkish language has agglutination (https://en.0wikipedia.org/index.php?q=aHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvQWdnbHV0aW5hdGlvbg ) feature that is very complex morphology to inspect into. For these cases stemming/lemmatisation is needed in order to slim down the corpus into a very reasonable small set. – XentneX Aug 21 '17 at 21:22

2 Answers2

9

I think it really matters about what you want to solve with this. It depends on the task.

Essentially by lemmatization, you make the input space sparser, which can help if you don't have enough training data.

But since Word2Vec is fairly big, if you have big enough training data, lemmatization shouldn't gain you much.

Something more interesting is, how to do tokenization with respect to the existing diction of words-vectors inside the W2V (or anything else). Like "Good muffins cost $3.88\nin New York." needs to be tokenized to ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New York.'] Then you can replace it with its vectors from W2V. The challenge is that some tokenizers my tokenize "New York" as ['New' 'York'], which doesn't make much sense. (For example, NLTK is making this mistake https://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html) This is a problem when you have many multi-word phrases.

Prometheus
  • 1,148
  • 14
  • 21
Daniel
  • 5,839
  • 9
  • 46
  • 85
  • 1
    >> "Something more interesting is, how to do tokenization with respect to the existing disction of words-vectors inside the W2V (or anything else)" what do you mean with tokenization in this context?, Thanks – Luca Fiaschi May 28 '14 at 11:10
  • 1
    Like "Good muffins cost $3.88\nin New York." needs to be tokenized to ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New York.'] Then you can replace it with its vectors from W2V. The challenge is that some tokenizers my tokenize "New York" as ['New' 'York'], which doesn't make much sense. (For example NLTK is making this mistake https://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html) This is a problem when you have many multi-word phrases. – Daniel May 28 '14 at 20:16
  • 3
    >>"Essentially by lemmatization you make the input space sparser". did you mean if you keep both the lemmatized and the original form of the tokens? otherwise wouldn't lemmatization make the input space much smaller? – samsamara Jan 11 '16 at 12:57
  • 2
    Lemmatisation makes data denser, thus reducing the amount of data required for adequate training. – Eli Korvigo Oct 31 '16 at 03:27
3

The current project I am working on involves identifying gene names within Biology papers abstracts using the vector space created by Word2Vec. When we run the algorithm without lemmatizing the Corpus mainly 2 problems arise:

  • The vocabulary gets way too big, since you have words in different forms which in the end have the same meaning.
  • As noted above, your space get less sparse, since you get more representatives of a certain "meaning", but at the same time, some of these meanings might get split among its representatives, let me clarify with an example

We are currently interest in a gene recognized by the acronym BAD. At the same time, "bad" is a english word which has different forms (badly, worst, ...). Since Word2vec build its vectors based on the context (its surrounding words) probability, when you don't lemmatize some of these forms, you might end up losing the relationship between some of these words. This way, in the BAD case, you might end up with a word closer to gene names instead of adjectives in the vector space.

Roger
  • 1,053
  • 1
  • 8
  • 14
  • Make the tokens case sensitive if the acronym is always BAD. – lucid_dreamer Apr 17 '18 at 00:18
  • But then you get a lot of noise for every word that occurs at the beginning of a sentence and therefore starts with a capital letter. another approach would be to use POS-tagging to identify "bad" either as noun or as adjective – N4ppeL Jul 05 '18 at 14:20