1
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
from gensim import corpora, models, similarities
from nltk.corpus import stopwords
import codecs

documents = []
with codecs.open("Master_File_for_Docs.txt", encoding = 'utf-8', mode= "r") as fid:
   for line in fid:
       documents.append(line)
stoplist = []
x = stopwords.words('english')
for word in x:
    stoplist.append(word)

#Removes Stopwords
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]


dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

lda = models.LdaModel(corpus, id2word=dictionary, num_topics=100)
lda.print_topics(20)
#corpus_lda = lda[corpus]
#for doc in corpus_lda:
 #   print(doc)

I am running Gensim for topic modeling and trying to get the above code working. I know that this code works because my friend ran it from a mac computer and it worked successfully but when I run it from a windows computer the code gives me a

MemoryError

Also the logging that I set on the second line also doesn't appear on my windows computer. Is there something in Windows that I need to fix in order for gensim to work?

David Yi
  • 401
  • 1
  • 6
  • 18

2 Answers2

0

I have installed gensim in my windows computer successfully,but it also appears memoryError, when I set the topic numbers larger for big data. because the space complexity of gensim is O(K*V) where the K is topics numbers and V is the size of the dictionary, it depends on your computer RAM. so you can set the topic numbers to 50 or less than 100, which can solve it. maybe firstly you should test the example on the genism official website:http://radimrehurek.com/gensim/index.html

zack
  • 1
0

The MemoryError appears because Gensim is trying to keep all of the data you need in memory while analyzing it. The solutions are scarse:

  • Use a server with more memory (AWS machine, anything more powerful than your PC)
  • Try a python interpreter in 64 bit
  • Try reducing the size parameter in model.save(). This will lead to have less features representing your words
  • Try increasing the min_count parameter in model.save(). This will make the model consider only words that appear at least min_count times

Be careful though, these last 2 advices will modify the characteristics of your model

Nicolò Gasparini
  • 2,228
  • 2
  • 24
  • 53