20

I wanted to understand the way fastText vectors for sentences are created. According to this issue 309, the vectors for sentences are obtained by averaging the vectors for words.

In order to confirm this, I wrote the following script:

import numpy as np
import fastText as ft

# Loading model for Finnish.
model = ft.load_model('cc.fi.300.bin')

# Getting word vectors for 'one' and 'two'.
one = model.get_word_vector('yksi')
two = model.get_word_vector('kaksi')

# Getting the sentence vector for the sentence "one two" in Finnish.
one_two = model.get_sentence_vector('yksi kaksi')
one_two_avg = (one + two) / 2

# Checking if the two approaches yield the same result.
is_equal = np.array_equal(one_two, one_two_avg)

# Printing the result.
print(is_equal)

# Result: FALSE

But, It seems that the obtained vectors are not similar.

Why aren't both values the same? Would it be related to the way I am averaging the vectors? Or, maybe there is something I am missing?

Nordle
  • 2,915
  • 3
  • 16
  • 34
ryuzakinho
  • 1,891
  • 3
  • 21
  • 35

3 Answers3

29

First, you missed the part that get_sentence_vector is not just a simple "average". Before FastText sum each word vector, each vector is divided with its norm (L2 norm) and then the averaging process only involves vectors that have positive L2 norm value.

Second, a sentence always ends with an EOS. So if you try to calculate manually you need to put EOS before you calculate the average.

try this (I assume the L2 norm of each word is positive):


def l2_norm(x):
   return np.sqrt(np.sum(x**2))

def div_norm(x):
   norm_value = l2_norm(x)
   if norm_value > 0:
       return x * ( 1.0 / norm_value)
   else:
       return x

# Getting word vectors for 'one' and 'two'.
one = model.get_word_vector('yksi')
two = model.get_word_vector('kaksi')
eos = model.get_word_vector('\n')

# Getting the sentence vector for the sentence "one two" in Finnish.
one_two = model.get_sentence_vector('yksi kaksi')
one_two_avg = (div_norm(one) + div_norm(two) + div_norm(eos)) / 3

You can see the source code here or you can see the discussion here.

malioboro
  • 3,097
  • 4
  • 35
  • 55
2

Even though it is an old question, fastText is a good starting point to easily understand generating sentence vectors by averaging individual word vectors and explore the simplicity, advantages and shortcomings and try out other things like SIF or SentenceBERT embeddings or (with an API key if you have one) the OpenAI embeddings. So I would like to mention that the use of eos as mentioned by @maliboro in one of the answers is not correct. This can be checked by the code below:

import fasttext.util

fasttext.util.download_model('en', if_exists='ignore')
ft_en_model = fasttext.load_model('cc.en.300.bin')

def normalize_vector(vec):
    norm = np.sqrt(np.sum(vec**2))
    if not norm==0:
        return vec/norm
    else:
        return vec

vec1 = normalize_vector(ft_en_model.get_word_vector('Paris'))
vec2 = normalize_vector(ft_en_model.get_word_vector('is'))
vec3 = normalize_vector(ft_en_model.get_word_vector('the'))
vec4 = normalize_vector(ft_en_model.get_word_vector('capital'))
vec5 = normalize_vector(ft_en_model.get_word_vector('of'))
vec6 = normalize_vector(ft_en_model.get_word_vector('France'))

sent_vec = (vec1+vec2+vec3+vec4+vec5+vec6)/6.0
print(sent_vec[0:10])

vec_s1 = ft_en_model.get_sentence_vector('Paris is the capital of France')
print(vec_s1[0:10])

The answer in both cases is:

[-0.00648477 -0.01590857 -0.02449585 -0.00863768 -0.00655541  0.00647134
  0.01945119 -0.00058179 -0.03748131  0.01811352]
Kumar Saurabh
  • 711
  • 7
  • 7
0

You might be hitting an issue with floating point math - e.g. if one addition was done on a CPU and one on a GPU they could differ.

The best way to check if it's doing what you want is to make sure the vectors are almost exactly the same.

You might want to print out the two vectors and manually inspect them, or do the dotproduct of one_two minus one_two_avg on itself (i.e. the length of the difference between the two).

garysieling
  • 346
  • 2
  • 7