Using a Word2Vec model pre-trained on wikipedia

Question

I need to use gensim to get vector representations of words, and I figure the best thing to use would be a word2vec module that's pre-trained on the english wikipedia corpus. Does anyone know where to download it, how to install it, and how to use gensim to create the vectors?

Have you seen this [page](https://github.com/idio/wiki2vec/) before? — imanzabet, Jul 26 '17 at 08:41
This [link](https://www.quora.com/Where-can-I-find-pre-trained-models-of-word2vec-and-sentence2vec-trained-on-Wikipedia-or-other-large-text-corpus) also might be helpful — imanzabet, Jul 26 '17 at 08:42

formi23 · Answer 1 · 2017-12-14T04:26:24.853

You can check WebVectors to find Word2Vec models trained on various corpora. Models come with readme covering the training details. You'll have to be a bit careful using these models, though. I'm not sure about all of them, but at least in Wikipedia's case, the model is not a binary file that you can straightforwardly load using e.g. gensim's functionality, but a txt version, i.e. file with words and corresponding vectors. Keep in mind, though, that the words are appended by their part-of-speech (POS) tags, so for example, if you'd like to use the model to find out similarities for word vacation, you'll get a KeyError if you type vacation as is, since the model stores this word as vacation_NOUN. An example snippet of how you could use the wiki model (perhaps others as well if they're in the same format) and an output is below

import gensim.models

model = "./WebVectors/3/enwiki_5_ner.txt"

word_vectors = gensim.models.KeyedVectors.load_word2vec_format(model, binary=False)
print(word_vectors.most_similar("vacation_NOUN"))
print(word_vectors.most_similar(positive=['woman_NOUN', 'king_NOUN'], negative=['man_NOUN']))

and the output

▶ python3 wiki_model.py
[('vacation_VERB', 0.6829521656036377), ('honeymoon_NOUN', 0.6811978816986084), ('holiday_NOUN', 0.6588436365127563), ('vacationer_NOUN', 0.6212040781974792), ('resort_NOUN', 0.5720850825309753), ('trip_NOUN', 0.5585346817970276), ('holiday_VERB', 0.5482848882675171), ('week-end_NOUN', 0.5174300670623779), ('newlywed_NOUN', 0.5146450996398926), ('honeymoon_VERB', 0.5135983228683472)]
[('monarch_NOUN', 0.6679952144622803), ('ruler_NOUN', 0.6257176995277405), ('regnant_NOUN', 0.6217397451400757), ('royal_ADJ', 0.6212111115455627), ('princess_NOUN', 0.6133661866188049), ('queen_NOUN', 0.6015778183937073), ('kingship_NOUN', 0.5986001491546631), ('prince_NOUN', 0.5900266170501709), ('royal_NOUN', 0.5886058807373047), ('throne_NOUN', 0.5855424404144287)]

UPDATE Here are some useful links to binary models:

Pretrained word embedding models:

Fasttext models:

crawl-300d-2M.vec.zip: 2 million word vectors trained on Common Crawl (600B tokens).
wiki-news-300d-1M.vec.zip: 1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).
wiki-news-300d-1M-subword.vec.zip: 1 million word vectors trained with subword infomation on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).
Wiki word vectors, dim=300: wiki.en.zip: bin+text model

Google Word2Vec

Pretrained word/phrase vectors:
- GoogleNews-vectors-negative300.bin.gz
- GoogleNews-vectors-negative300-SLIM.bin.gz: slim version with app. 300k words
Pretrained entity vectors:
- freebase-vectors-skipgram1000.bin.gz: Entity vectors trained on 100B words from various news articles
- freebase-vectors-skipgram1000-en.bin.gz: Entity vectors trained on 100B words from various news articles, using the deprecated /en/ naming (more easily readable); the vectors are sorted by frequency

GloVe: Global Vectors for Word Representation

glove.6B.zip: Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, 822 MB download). Here's an example in action.
glove.840B.300d.zip: Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download)

WebVectors

models trained on various corpora, augmented by Part-of-Speech (POS) tags

You can use `pickle.dump(model, file)` with a gensim model loaded by `load_word2vec_format()`. Then, `model = pickle.load(file)` works much faster then parsing .vec files each time. — Rabash, Apr 18 '18 at 14:36
Who trained GoogleNews-vectors-negative300.bin.gz? I need to cite it for my thesis — Marco, Aug 10 '19 at 09:36
Thank you for providing a concise list of so many pre trained models! — Moltres, Mar 14 '20 at 15:40
I am currently wondering whether the code example from https://fasttext.cc/docs/en/crawl-vectors.html results in a model trained on only Common Crawl or on Common Crawl + Wikipedia. Does anybody know? ft = fasttext.load_model('cc.en.300.bin') looks like CommonCrawl only but on top of the site it says "We distribute pre-trained word vectors for 157 languages, trained on Common Crawl and Wikipedia using fastText." — monart, Aug 13 '20 at 08:11

score 2 · Accepted Answer · answered Jul 26 '17 at 15:47

2

@imanzabet provided useful links with pre-trained vectors, but if you want to train the models yourself using genism than you need to do two things:

Acquire the Wikipedia data, which you can access here. Looks like the most recent snapshot of English Wikipedia was on the 20th, and it can be found here. I believe the other English-language "wikis" e.g. quotes are captured separately, so if you want to include them you'll need to download those as well.
Load the data and use it to generate the models. That's a fairly broad question, so I'll just link you to the excellent genism documentation and word2vec tutorial.

Finally, I'll point out that there seems to be a blog post describing precisely your use case.

answered Jul 26 '17 at 15:47

Suriname0

527
1
8
21

Unfortunately the code on the github page imanzbet posted is out of date, and although I've gone over the gensim documentation I don't know where to start in terms of setting up a pre-trained model. – Boris Jul 26 '17 at 16:24
Did you try [`load_word2vec_format`](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.load_word2vec_format) as specified in the docs? – Suriname0 Jul 27 '17 at 04:17
1

See also: http://ahogrammer.com/2017/01/20/the-list-of-pretrained-word-embeddings/ – Suriname0 Jul 27 '17 at 04:20
I did the following after downloading the english wikipedia corpus as a .bin file: – Boris Jul 27 '17 at 15:58
model = models.KeyedVectors.load_word2vec_format('english.bin.gz', binary=True) The similarity query works, but when I try model("word) I get a giant array of values in 3 columns. – Boris Jul 27 '17 at 16:04
1

That doesn't *sound* wrong; the word vectors should be returned from the KeyedVectors when you subscript a word, something like `array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32)`. What are the dimensions of the array being returned? The "3 columns" may just be a display artifact. – Suriname0 Jul 28 '17 at 17:19
That's good to hear. The .size and .zshape methods both return 200. Seeing as I'm new to this, could you explain why some of the values are negative, why there are so many, and what they represent? Would be much appreciated. – Boris Jul 29 '17 at 11:57
1

Sure, although the most helpful thing would probably be just more reading about distributed word embeddings. (I like [this PDF slide deck](http://u.cs.biu.ac.il/~yogo/cvsc2015.pdf).) Basically, the gist is to create fixed-length (in this case 1 x 200) vectors that place (or embed) "similar" words close together in vector space and "dissimilar" words far away from each other. This is nice because (1) can do cosine similarity with vectors very easily and (2) the fixed length vectors are great as input to machine learning algorithms, due to reduced dimensionality and capturing contextual info. – Suriname0 Jul 29 '17 at 14:21
1

So to answer the actual question, the answer is "any particular dimension within the vector is essentially uninterpretable" and "positive and negative values are arbitrary; you could rescale all the values to the 0-1 range and you should keep the same characteristics". – Suriname0 Jul 29 '17 at 14:23