I'm getting a CalledProcessError "non-zero exit status 1" error when I run the Gensim LDAMallet model on my full corpus of ~16 million documents. Interestingly enough, if I run the exact same code on a testing corpus of ~160,000 documents the code runs perfectly fine. Since it's working fine on my small corpus I'm inclined to think that the code is fine, but I'm not sure what else would/could cause this error...
I've tried editing the mallet.bat file as suggested here, but to no avail. I've also double checked the paths, but that shouldn't be an issue given that it works with a smaller corpus.
id2word = corpora.Dictionary(lists_of_words)
corpus =[id2word.doc2bow(doc) for doc in lists_of_words]
num_topics = 30
os.environ.update({'MALLET_HOME':r'C:/mallet-2.0.8/'})
mallet_path = r'C:/mallet-2.0.8/bin/mallet'
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)
Here's the full traceback and error:
File "<ipython-input-57-f0e794e174a6>", line 8, in <module>
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)
File "C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\wrappers\ldamallet.py", line 132, in __init__
self.train(corpus)
File "C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\wrappers\ldamallet.py", line 273, in train
self.convert_input(corpus, infer=False)
File "C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\wrappers\ldamallet.py", line 262, in convert_input
check_output(args=cmd, shell=True)
File "C:\ProgramData\Anaconda3\lib\site-packages\gensim\utils.py", line 1918, in check_output
raise error
CalledProcessError: Command 'C:/mallet-2.0.8/bin/mallet import-file --preserve-case --keep-sequence --remove-stopwords --token-regex "\S+" --input C:\Users\user\AppData\Local\Temp\2\e1ba4a_corpus.txt --output C:\Users\user\AppData\Local\Temp\2\e1ba4a_corpus.mallet' returned non-zero exit status 1.