Questions tagged [gensim]

Gensim is a free Python framework designed to automatically extract semantic topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible.

Gensim aims at processing raw, unstructured digital texts (“plain text”). The algorithms in gensim, such as Latent Semantic Analysis, Latent Dirichlet Allocation or Random Projections, discover semantic structure of documents, by examining word statistical co-occurrence patterns within a corpus of training documents. These algorithms are unsupervised, which means no human input is necessary – you only need a corpus of plain text documents.

Once these statistical patterns are found, any plain text documents can be succinctly expressed in the new, semantic representation, and queried for topical similarity against other documents.

Resources and Tutorials:

2433 questions
145
votes
14 answers

How to calculate the sentence similarity using word2vec model of gensim with python

According to the Gensim Word2Vec, I can use the word2vec model in gensim package to calculate the similarity between 2 words. e.g. trained_model.similarity('woman', 'man') 0.73723527 However, the word2vec model fails to predict the sentence…
zhfkt
  • 2,415
  • 3
  • 21
  • 24
68
votes
10 answers

Convert word2vec bin file to text

From the word2vec site I can download GoogleNews-vectors-negative300.bin.gz. The .bin file (about 3.4GB) is a binary format not useful to me. Tomas Mikolov assures us that "It should be fairly straightforward to convert the binary format to text…
Glenn
  • 6,455
  • 4
  • 33
  • 42
58
votes
4 answers

Doc2vec: How to get document vectors

How to get document vectors of two text documents using Doc2vec? I am new to this, so it would be helpful if someone could point me in the right direction / help me with some tutorial I am using gensim. doc1=["This is a sentence","This is another…
bee2502
  • 1,145
  • 1
  • 10
  • 13
55
votes
1 answer

gensim Doc2Vec vs tensorflow Doc2Vec

I'm trying to compare my implementation of Doc2Vec (via tf) and gensims implementation. It seems atleast visually that the gensim ones are performing better. I ran the following code to train the gensim model and the one below that for tensorflow…
sachinruk
  • 9,571
  • 12
  • 55
  • 86
54
votes
5 answers

gensim word2vec: Find number of words in vocabulary

After training a word2vec model using python gensim, how do you find the number of words in the model's vocabulary?
hlin117
  • 20,764
  • 31
  • 72
  • 93
51
votes
6 answers

PyTorch / Gensim - How do I load pre-trained word embeddings?

I want to load a pre-trained word2vec embedding with gensim into a PyTorch embedding layer. How do I get the embedding weights loaded by gensim into the PyTorch embedding layer?
MBT
  • 21,733
  • 19
  • 84
  • 102
49
votes
18 answers

gensim error: ImportError: No module named 'gensim'

I trying to import gensim with import gensim but get the following error ImportError Traceback (most recent call last) in () ----> 1 import gensim 2 model =…
woojung
  • 501
  • 1
  • 4
  • 4
48
votes
5 answers

How to create a word cloud from a corpus in Python?

From Creating a subset of words from a corpus in R, the answerer can easily convert a term-document matrix into a word cloud easily. Is there a similar function from python libraries that takes either a raw word textfile or NLTK corpus or Gensim…
alvas
  • 115,346
  • 109
  • 446
  • 738
46
votes
4 answers

How to use Gensim doc2vec with pre-trained word vectors?

I recently came across the doc2vec addition to Gensim. How can I use pre-trained word vectors (e.g. found in word2vec original website) with doc2vec? Or is doc2vec getting the word vectors from the same sentences it uses for paragraph-vector…
Stergios
  • 3,126
  • 6
  • 33
  • 55
45
votes
4 answers

How to get tfidf with pandas dataframe?

I want to calculate tf-idf from the documents below. I'm using python and pandas. import pandas as pd df = pd.DataFrame({'docId': [1,2,3], 'sent': ['This is the first sentence','This is the second sentence', 'This is the third…
user1610952
  • 1,249
  • 1
  • 16
  • 31
45
votes
8 answers

How to check if a key exists in a word2vec trained model or not

I have trained a word2vec model using a corpus of documents with Gensim. Once the model is training, I am writing the following piece of code to get the raw feature vector of a word say "view". myModel["view"] However, I get a KeyError for the word…
London guy
  • 27,522
  • 44
  • 121
  • 179
44
votes
1 answer

Doc2Vec Get most similar documents

I am trying to build a document retrieval model that returns most documents ordered by their relevancy with respect to a query or a search string. For this I trained a doc2vec model using the Doc2Vec model in gensim. My dataset is in the form of a…
Clock Slave
  • 7,627
  • 15
  • 68
  • 109
43
votes
1 answer

How to extract phrases from corpus using gensim

For preprocessing the corpus I was planing to extarct common phrases from the corpus, for this I tried using Phrases model in gensim, I tried below code but it's not giving me desired output. My code from gensim.models import Phrases documents =…
Prashant Puri
  • 2,324
  • 1
  • 15
  • 21
39
votes
6 answers

Update gensim word2vec model

I have a word2vec model in gensim trained over 98892 documents. For any given sentence that is not present in the sentences array (i.e. the set over which I trained the model), I need to update the model with that sentence so that querying it next…
user2480542
  • 2,845
  • 4
  • 24
  • 25
34
votes
3 answers

Python Gensim: how to calculate document similarity using the LDA model?

I've got a trained LDA model and I want to calculate the similarity score between two documents from the corpus I trained my model on. After studying all the Gensim tutorials and functions, I still can't get my head around it. Can somebody give me a…
still_st
  • 363
  • 1
  • 3
  • 7
1
2 3
99 100