11

I trained my unsupervised model using fasttext.train_unsupervised() function in python. I want to save it as vec file since I will use this file for pretrainedVectors parameter in fasttext.train_supervised() function. pretrainedVectors only accepts vec file but I am having troubles to creating this vec file. Can someone help me?

Ps. I am able to save it in bin format. It would be also helpful if you suggest me a way to convert bin file to vec file.

Dima Lituiev
  • 12,544
  • 10
  • 41
  • 58
esin ildiz
  • 111
  • 1
  • 4

2 Answers2

15

To obtain VEC file, containing merely all words vectors, I took inspiration from bin_to_vec official example.

from fasttext import load_model

# original BIN model loading
f = load_model(YOUR-BIN-MODEL-PATH)
    lines=[]

# get all words from model
words = f.get_words()

with open(YOUR-VEC-FILE-PATH,'w') as file_out:
    
    # the first line must contain number of total words and vector dimension
    file_out.write(str(len(words)) + " " + str(f.get_dimension()) + "\n")

    # line by line, you append vectors to VEC file
    for w in words:
        v = f.get_word_vector(w)
        vstr = ""
        for vi in v:
            vstr += " " + str(vi)
        try:
            file_out.write(w + vstr+'\n')
        except:
            pass

The obtained VEC file can be big. To reduce file size, you can adjust the format of vector components.

If you want to keep only 4 decimal digits, you can replace vstr += " " + str(vi) with
vstr += " " + "{:.4f}".format(vi)

tonywang
  • 181
  • 2
  • 13
  • 1
    ValueError: Dimension of pretrained vectors (7598805550878845300) does not match dimension (300)! Unfortunately it gives me this error when I try to use the vec file that I created in that way. It seems it doesn't keep the dimensions of the word vectors that are supposed to be 300. – esin ildiz Dec 24 '19 at 17:30
  • I received a similar error: "ValueError: Dimension of pretrained vectors (0) does not match dimension (100)!" I fixed the problem by adding the output of this code: str(len(words)) + " " + str(f.get_dimension()) to the first line of the file, as suggested by @darwin007 – dshefman Feb 21 '20 at 00:04
  • 1
    I would use read/write type "a" with extreme caution. In fact, there is no value to use "a" after the last change to the answer. If you run the line of code more than once, you will end up appending the word length, dimensions, and all the words and vectors every time you run that line of code. Using "w" instead of "a" will rewrite the file every time you run the code, which is what you probably want. Full line solution: with open(YOUR-VEC-FILE-PATH,'w') as file_out: – dshefman Feb 22 '20 at 15:33
1

you should add words num and dimension at first line of your vec file, than use -preTrainedVectors para

darwin007
  • 11
  • 2