2

I have a gensim Word2Vec model computed in Python 2 like that:

from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence

model = Word2Vec(LineSentence('enwiki.txt'), size=100, 
                 window=5, min_count=5, workers=15)
model.save('w2v.model')

However, I need to use it in Python 3. If I try to load it,

import gensim
from gensim.models import Word2Vec
model = Word2Vec.load('w2v.model')

it results in an error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xf9 in position 0: ordinal not in range(128)

I suppose the problem is in differences in encoding between Python2 and Python3. Also it seems like gensim is using pickle to save/load models.

Is there a way to set encoding/pickle options so that the model loads properly? Or maybe use some external tool to convert the model file?

Recomputing it in Python 3 is not an option: it takes way too much time.

DLunin
  • 1,050
  • 10
  • 20
  • For better python 2/3 interoperability an encoding should be specified, as noted [here](http://stackoverflow.com/questions/11305790/pickle-incompatability-of-numpy-arrays-between-python-2-and-3). Since `gemsim` could use 2 methods to open a file, the smart_open library or an alternative method, the full traceback is required in finding a solution or workaround. – memoselyk Nov 09 '15 at 04:19
  • How did you solve it in the end? I tried the answer below, didn't work for me. – Sapling Jan 29 '19 at 13:49

1 Answers1

2

This indeed looks like a bug somewhere, as noted by memoselyk, and can be fixed in a way described in a comment to this answer.

So you have to add encoding='latin1' to a call to _pickle.loads in gensim.utils.unpickle, load the model in Python 3, then save it, and now you can revert this fix and load this new model in unmodified gensim with Python 3.

Community
  • 1
  • 1