Upload a pre-trained spanish language word vectors and then retrain it with custom sentences? (GENSIM -FASTTEXT)

Question

I am trying to upload a pre-trained spanish language word vectors and then retrain it with custom sentences:

!pip install fasttext
import fasttext
import fasttext.util
#download pre-trained spanish language word vectors c
fasttext.util.download_model('es', if_exists='ignore')  # Spanish
ft = fasttext.load_model('cc.es.300.bin')

but once I try to update the vocabulary it gives me this AttributeError:

ft.build_vocab(sentences, update=True)
AttributeError: '_FastText' object has no attribute 'build_vocab'

Any advices?

Please read these answers: https://stackoverflow.com/a/64711974/10883094 (and https://stackoverflow.com/a/58342618/10883094). In any case, you must use a syntax like this: `model = fasttext.train_supervised(input=TRAIN_FILEPATH, ..., pretrainedVectors=VECTORS_FILEPATH)` — Stefano Fiorucci - anakin87, Dec 13 '21 at 09:08
Thanks, I was checking those answers but I need to retrain an unsupervised model, I have a small corpus and first I want to load a spanish corpus and then retrain it with my small corpus. And for what read in the doc: '''model = fasttext.train_unsupervised('data.txt', model='skipgram')''' or if I load it directly '''model = fasttext.load_model("model_filename.bin")''' I can't find how to retrain the fasttext model with my own data set, I don't think I have to use ''' fasttext.train_supervisedfasttext.train_supervised''' — mrbangybang, Dec 13 '21 at 17:32

score 0 · Accepted Answer · answered Dec 13 '21 at 15:39

0

The build_vocab() method supports a step in the Gensim library implementation of the FastText algorithm - not the original fastttext package from Facebook that you seem to be loading. (You're mixing code meant for two different libraries.)

If you switch to using Gensim code, rather than Facebook's implementation, you won't get that same error when trying to use build_vocab().

Note, though, that what you're attempting, incremental retraining of an existing model, is an advanced/experimental technique that can easily backfire. So it's usually a bad idea to attempt without expertise & rigorous checks as to whether the extra complications are helping.

answered Dec 13 '21 at 15:39

gojomo

52,260
14
86
115

Thanks, for your comment. I know is not easy task for a newbie but I really need to do it, I am working with a small corpus and first I want to load a spanish corpus and then retrain it with my small corpus. Do you know a tutorial or article that could help maybe? – mrbangybang Dec 13 '21 at 17:36
1

I don't know of any good tutorials for such incremental tuning of a FastText model - most attempts I see are seriously misguided. Why do you think an undocumented, difficult process is needed in your situation? (Are you sure the generic vectors don't work well enough? Are you sure you can't just train new vectors, by assembling an adequately-sized & representative corpus yourself?) – gojomo Dec 13 '21 at 19:31

Upload a pre-trained spanish language word vectors and then retrain it with custom sentences? (GENSIM -FASTTEXT)

1 Answers1