1

I am trying to load a fasttext .bin model in spanish, donwloaded from https://fasttext.cc/docs/en/crawl-vectors.html and continue training it with new sentences from the specific domain I am interested in.

System: Anaconda, Jupyter Notebook, Python 3.6, Upgraded Gensim

My code (toy example):

from gensim.models.fasttext import load_facebook_model
import os
os.chdir('path/to/directory')
model = load_facebook_model('cc.es.300.bin')

'enmadrarse' in model.wv.vocab
>>> False
old_vector = np.copy(model.wv['enmadrarse'])

new_sentences = [['complexidad', 'cataratas', 'enmadrarse'],
['enmadrarse', 'cataratas', 'increibles'], 
['unidad','enmadrarse','complexa']]

model.build_vocab(new_sentences, update = True)
model.train(new_sentences, total_examples = len(new_sentences), epochs=model.epochs)

new_vector = np.copy(model.wv['enmadrarse'])
np.allclose(old_vector, new_vector, atol=1e-4)
>>> True

'enmadrarse' in model.wv.vocab
>>> False (still)

The old and new vectors of the word are equal and it remains out of the vocab so the model learnt nothing. What am I doing wrong?

  • 2
    What is the `model.min_count`? (Even using `update=True` with `build_vocab()`, words that don't appear at least `min_count` times will be ignored.) Also, do the other words in your synthetic texts either already exist, or appear at least `min_count` times in your new texts? (If they're ignored, leaving any of your sentences with just one effective word, no training will occur from such a sentence – as training requires neighboring words. – gojomo Feb 05 '20 at 19:03
  • 1
    Totally right. It worked when added words that exist in the vocab and increased to min_count the number of times the given term appears in the new texts. Thank you – Yanis Kartalis Feb 06 '20 at 13:59

0 Answers0