word2vec lemmatization of corpus before training

Question

Word2vec seems to be mostly trained on raw corpus data. However, lemmatization is a standard preprocessing for many semantic similarity tasks. I was wondering if anybody had experience in lemmatizing the corpus before training word2vec and if this is a useful preprocessing step to do.

For instance, Turkish language has agglutination (https://en.0wikipedia.org/index.php?q=aHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvQWdnbHV0aW5hdGlvbg ) feature that is very complex morphology to inspect into. For these cases stemming/lemmatisation is needed in order to slim down the corpus into a very reasonable small set. — XentneX, Aug 21 '17 at 21:22

score 9 · Accepted Answer · edited Apr 16 '19 at 06:55

9

I think it really matters about what you want to solve with this. It depends on the task.

Essentially by lemmatization, you make the input space sparser, which can help if you don't have enough training data.

But since Word2Vec is fairly big, if you have big enough training data, lemmatization shouldn't gain you much.

Something more interesting is, how to do tokenization with respect to the existing diction of words-vectors inside the W2V (or anything else). Like "Good muffins cost $3.88\nin New York." needs to be tokenized to ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New York.'] Then you can replace it with its vectors from W2V. The challenge is that some tokenizers my tokenize "New York" as ['New' 'York'], which doesn't make much sense. (For example, NLTK is making this mistake https://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html) This is a problem when you have many multi-word phrases.

edited Apr 16 '19 at 06:55

Prometheus

1,148
14
21

answered May 27 '14 at 09:02

Daniel

5,839
9
46
85

1

>> "Something more interesting is, how to do tokenization with respect to the existing disction of words-vectors inside the W2V (or anything else)" what do you mean with tokenization in this context?, Thanks – Luca Fiaschi May 28 '14 at 11:10
1

Like "Good muffins cost $3.88\nin New York." needs to be tokenized to ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New York.'] Then you can replace it with its vectors from W2V. The challenge is that some tokenizers my tokenize "New York" as ['New' 'York'], which doesn't make much sense. (For example NLTK is making this mistake https://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html) This is a problem when you have many multi-word phrases. – Daniel May 28 '14 at 20:16
3

>>"Essentially by lemmatization you make the input space sparser". did you mean if you keep both the lemmatized and the original form of the tokens? otherwise wouldn't lemmatization make the input space much smaller? – samsamara Jan 11 '16 at 12:57
2

Lemmatisation makes data denser, thus reducing the amount of data required for adequate training. – Eli Korvigo Oct 31 '16 at 03:27

score 3 · Answer 2 · answered May 27 '16 at 15:12

The current project I am working on involves identifying gene names within Biology papers abstracts using the vector space created by Word2Vec. When we run the algorithm without lemmatizing the Corpus mainly 2 problems arise:

The vocabulary gets way too big, since you have words in different forms which in the end have the same meaning.
As noted above, your space get less sparse, since you get more representatives of a certain "meaning", but at the same time, some of these meanings might get split among its representatives, let me clarify with an example

We are currently interest in a gene recognized by the acronym BAD. At the same time, "bad" is a english word which has different forms (badly, worst, ...). Since Word2vec build its vectors based on the context (its surrounding words) probability, when you don't lemmatize some of these forms, you might end up losing the relationship between some of these words. This way, in the BAD case, you might end up with a word closer to gene names instead of adjectives in the vector space.

Make the tokens case sensitive if the acronym is always BAD. — lucid_dreamer, Apr 17 '18 at 00:18
But then you get a lot of noise for every word that occurs at the beginning of a sentence and therefore starts with a capital letter. another approach would be to use POS-tagging to identify "bad" either as noun or as adjective — N4ppeL, Jul 05 '18 at 14:20

word2vec lemmatization of corpus before training

2 Answers2