2

A jupyter notebook cannot import dirichlet_likelihood.py from lda2vec.
This py file exists in github for the current lda2vec.

I installed the module and opened the workbook, then attempted to run it. I suspect there is a very simple reason for my problem.

The notebook is https://github.com/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

When I try the below at the python command-line (in the current environment) it does not give the below error and instead it wanted keras, which I installed. On the command line it says it cannot import preprocess.

uname -a
Linux ubuntu 4.18.0-15-generic #16~18.04.1-Ubuntu SMP Thu Feb 7 14:06:04 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
sudo apt-get install python3-venv
python3.6 -m venv .env  
source .env/bin/activate
pip install --upgrade pip
pip install jupyter
pip install lda2vec
from lda2vec import preprocess, Corpus
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-4-2b87256bea6b> in <module>
----> 1 from lda2vec import preprocess, Corpus
      2 import matplotlib.pyplot as plt
      3 import numpy as np
      4 get_ipython().run_line_magic('matplotlib', 'inline')
      5 

~/.env/lib/python3.6/site-packages/lda2vec/__init__.py in <module>
----> 1 import lda2vec.dirichlet_likelihood as dirichlet_likelihood
      2 import lda2vec.embedding_mixture as embedding_mixture
      3 from lda2vec.Lda2vec import Lda2vec as model
      4 import lda2vec.word_embedding as word_embedding
      5 import lda2vec.nlppipe as nlppipe

AttributeError: module 'lda2vec' has no attribute 'dirichlet_likelihood'
python
from lda2vec import preprocess, Corpus
Using TensorFlow backend.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'preprocess'

EDIT

I got it work by doing these things - Get ubuntu-18.04.2-desktop-amd64.iso - Have BIOS virtualisation settings changed for hyper-V - Make a VM in VMWare - Increase memory to 3G - Give it 40G of disk

Then in a terminal

sudo apt install python2.7
sudo apt install python-pip
pip install virtualenv
Mkdir 2.7env
Cd 2.7env
python2.7 -m venv .env
python2.7 -m virtualenv .env
source .env/bin/activate
pip install --upgrade pip
pip install jupyter
pip install -U spacy
python -m spacy download en
pip install wheel nltk gensim pyLDAvis lda2vec
sudo apt install git
git clone https://github.com/cemoody/lda2vec.git
cp ~/lda2vec/build/lib.linux-x86_64-2.7/lda2vec/corpus.py ~/2.7env/.env/lib/python2.7/site-packages/lda2vec/Corpus.py
cp ~/lda2vec/build/lib.linux-x86_64-2.7/lda2vec/preprocess.py ~/2.7env/.env/lib/python2.7/site-packages/lda2vec/preprocess.py
python -m pip install ipykernel
python -m ipykernel install --user
python lda2vec/examples/twenty_newsgroups/lda2vec/lda2vec_run.py
cd lda2vec/examples/twenty_newsgroups/lda2vec/
jupyter notebook
change the kernel to 2

In firefox open lda2vec.ipynb

As above, I am now stuck trying to get it to re-create the twenty_newsgroups npz file so I can eventually provide my own content. In case anyone out there understands this better, I suspect trying to run this script on VM with little RAM might be the problem but the error is reported as

(.env) craig@ubuntu:~/whcjimmy/lda2vec/examples/twenty_newsgroups/data$ python preprocess.py 
Traceback (most recent call last):
  File "preprocess.py", line 31, in <module>
    n_threads=4)
  File "/home/craig/whcjimmy/.env/lib/python3.6/site-packages/lda2vec-0.1-py3.6.egg/lda2vec/preprocess.py", line 104, in tokenize
    vocab = {v: nlp.vocab[v].lower_ for v in uniques if v != skip}
  File "/home/craig/whcjimmy/.env/lib/python3.6/site-packages/lda2vec-0.1-py3.6.egg/lda2vec/preprocess.py", line 104, in <dictcomp>
    vocab = {v: nlp.vocab[v].lower_ for v in uniques if v != skip}
  File "vocab.pyx", line 242, in spacy.vocab.Vocab.__getitem__
  File "lexeme.pyx", line 44, in spacy.lexeme.Lexeme.__init__
  File "vocab.pyx", line 157, in spacy.vocab.Vocab.get_by_orth
  File "strings.pyx", line 138, in spacy.strings.StringStore.__getitem__
KeyError: "[E018] Can't retrieve string for hash '9243420536193520'. This usually refers to an issue with the `Vocab` or `StringStore`."
TDG
  • 33
  • 6
  • did you create file with name `lda2vec.py` or folder `lda2vec.py` ? if you have it then `import` loads this file (or folder) instead of module `lda2vec` and it can't find `preprocess` in your file/folder. Remove `lda2vec.py` or rename it. – furas Sep 11 '19 at 05:55
  • The issue is that `~/.env/lib/python3.6/site-packages/lda2vec/dirichlet_likelihood.py` exists but the `__init__.py` line `import lda2vec.dirichlet_likelihood as dirichlet_likelihood` causes the error `module 'lda2vec' has no attribute 'dirichlet_likelihood'` – TDG Sep 13 '19 at 03:01
  • maybe it need some other module(s) to run it. I installed `lda2vec` few minutes ago and I had to install also `pyLDAvis` to import `lda2vec`. Linux Mint 19.2, Python 3.7.4 – furas Sep 13 '19 at 03:15
  • I got around this - it turned out that dirichlet_likelihood.py was simply missing and not copied from the git repo to the right place. If you follow the edited instructions above it works under 2.7. – TDG Sep 17 '19 at 07:03

1 Answers1

1

Alright I got this to work. The problems were

  1. pick the right python to run a 4 year old git project. Python 2.7.
  2. check the installed module has the code from the git repo
  3. run through the problem in a python terminal
  4. copy and edit python files go to 3

One problem above was to do with a changed API for a dependency. ImportError: No module named 'spacy.en' The original problem was likely due to something about git or python I am not familiar with. Still the git project self-tests all fail and its build fails. But I have my jupyter notebook running and producing convincing outputs.

TDG
  • 33
  • 6
  • Hi, thanks for sharing this. Can you also share the notebook with the running code? Lots of us would be very grateful for it. Thanks! – Uther Pendragon Sep 16 '19 at 00:07
  • 1
    Hi there, the 'running' notebook is using a pre-generated npz file. The issue I am having is that I can't get it to run preprocess.py for the twenty_newsgroups example and eat up some new content. The latest is that https://github.com/whcjimmy/lda2vec fork works, including the notebook, but spacy breaks with [E018] Can't retrieve string for hash '9243420536193520'. This usually refers to an issue with the `Vocab` or `StringStore` – TDG Sep 17 '19 at 05:56