Why are word embedding actually vectors?

Question

I am sorry for my naivety, but I don't understand why word embeddings that are the result of NN training process (word2vec) are actually vectors.

Embedding is the process of dimension reduction, during the training process NN reduces the 1/0 arrays of words into smaller size arrays, the process does nothing that applies vector arithmetic.

So as result we got just arrays and not the vectors. Why should I think of these arrays as vectors?

Even though, we got vectors, why does everyone depict them as vectors coming from the origin (0,0)?

Again, I am sorry if my question looks stupid.

I have no profound mathematical background, but aren't you mixing programming terms (array as a data structure) with mathematical ones (vectors as a mathematical concept)? — lenz, Oct 13 '17 at 08:54
A good reason to call word2vec's output "vectors" is that you can estimate the similarity of two words by measuring the cosine distance of their corresponding vectors. — lenz, Oct 13 '17 at 08:56
@lenz, thanks you for your comment. I just tried to say what word embeddings are. I think your reasoning "they are vectors because we calculate cosine distance is incorrect", actually we use cosine distance because they are vectors. But why they are vector, I still don't know. — com, Oct 13 '17 at 09:59
If you haven't already, have a look at this video, [Vectors, what even are they?](https://youtu.be/fNk_zzaMoSs) — Cedias, Oct 16 '17 at 13:34

score 8 · Accepted Answer · edited Jun 20 '20 at 09:12

What are embeddings?

Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers.

Conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with much lower dimension.

(Source: https://en.wikipedia.org/wiki/Word_embedding)

What is Word2Vec?

Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words.

Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space.

Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.

(Source: https://en.wikipedia.org/wiki/Word2vec)

What's an array?

In computer science, an array data structure, or simply an array, is a data structure consisting of a collection of elements (values or variables), each identified by at least one array index or key.

An array is stored so that the position of each element can be computed from its index tuple by a mathematical formula.

The simplest type of data structure is a linear array, also called one-dimensional array.

What's a vector / vector space?

A vector space (also called a linear space) is a collection of objects called vectors, which may be added together and multiplied ("scaled") by numbers, called scalars.

Scalars are often taken to be real numbers, but there are also vector spaces with scalar multiplication by complex numbers, rational numbers, or generally any field.

The operations of vector addition and scalar multiplication must satisfy certain requirements, called axioms, listed below.

(Source: https://en.wikipedia.org/wiki/Vector_space)

What's the difference between vectors and arrays?

Firstly, the vector in word embeddings is not exactly the programming language data structure (so it's not Arrays vs Vectors: Introductory Similarities and Differences).

Programmatically, a word embedding vector IS some sort of an array (data structure) of real numbers (i.e. scalars)

Mathematically, any element with one or more dimension populated with real numbers is a tensor. And a vector is a single dimension of scalars.

To answer the OP question:

Why are word embedding actually vectors?

By definition, word embeddings are vectors (see above)

Why do we represent words as vectors of real numbers?

To learn the differences between words, we have to quantify the difference in some manner.

Imagine, if we assign theses "smart" numbers to the words:

>>> semnum = semantic_numbers = {'car': 5, 'vehicle': 2, 'apple': 232, 'orange': 300, 'fruit': 211, 'samsung': 1080, 'iphone': 1200}
>>> abs(semnum['fruit'] - semnum['apple'])
21
>>> abs(semnum['samsung'] - semnum['apple'])
848

We see that the distance between fruit and apple is close but samsung and apple isn't. In this case, the single numerical "feature" of the word is capable of capturing some information about the word meanings but not fully.

Imagine the we have two real number values for each word (i.e. vector):

>>> import numpy as np
>>> semnum = semantic_numbers = {'car': [5, -20], 'vehicle': [2, -18], 'apple': [232, 1010], 'orange': [300, 250], 'fruit': [211, 250], 'samsung': [1080, 1002], 'iphone': [1200, 1100]}

To compute the difference, we could have done:

>>> np.array(semnum['apple']) - np.array(semnum['orange'])
array([-68, 761])

>>> np.array(semnum['apple']) - np.array(semnum['samsung'])
array([-848,    8])

That's not very informative, it returns a vector and we can't get a definitive measure of distance between the words, so we can try some vectorial tricks and compute the distance between the vectors, e.g. euclidean distance:

>>> import numpy as np
>>> orange = np.array(semnum['orange'])
>>> apple = np.array(semnum['apple'])
>>> samsung = np.array(semnum['samsung'])

>>> np.linalg.norm(apple-orange)
763.03604108849277

>>> np.linalg.norm(apple-samsung)
848.03773500947466

>>> np.linalg.norm(orange-samsung)
1083.4685043876448

Now, we can see more "information" that apple can be closer to samsung than orange to samsung. Possibly that's because apple co-occurs in the corpus more frequently with samsung than orange.

The big question comes, "How do we get these real numbers to represent the vector of the words?". That's where the Word2Vec / embedding training algorithms (originally conceived by Bengio 2003) comes in.

Taking a detour

Since adding more real numbers to the vector representing the words is more informative then why don't we just add a lot more dimensions (i.e. numbers of columns in each word vector)?

Traditionally, we compute the differences between words by computing the word-by-word matrices in the field of distributional semantics/distributed lexical semantics, but the matrices become really sparse with many zero values if the words don't co-occur with another.

Thus a lot of effort has been put into dimensionality reduction after computing the word co-occurrence matrix. IMHO, it's like a top-down view of how global relations between words are and then compressing the matrix to get a smaller vector to represent each word.

So the "deep learning" word embedding creation comes from the another school of thought and starts with a randomly (sometimes not-so random) initialized a layer of vectors for each word and learning the parameters/weights for these vectors and optimizing these parameters/weights by minimizing some loss function based on some defined properties.

It sounds a little vague but concretely, if we look at the Word2Vec learning technique, it'll be clearer, see

Here's more resources to read-up on word embeddings: https://github.com/keon/awesome-nlp#word-vectors

score 5 · Answer 2 · answered Oct 13 '17 at 10:14

5

the process does nothing that applies vector arithmetic

The training process has nothing to do with vector arithmetic, but when the arrays are produced, it turns out they have pretty nice properties, so that one can think of "word linear space".

For example, what words have embeddings closest to a given word in this space?

Put it differently, words with similar meaning form a cloud. Here's a 2-D t-SNE representation:

Another example, the distance between "man" and "woman" is very close to the distance between "uncle" and "aunt":

As a result, you have pretty much reasonable arithmetic:

W("woman") − W("man") ≃ W("aunt") − W("uncle")
W("woman") − W("man") ≃ W("queen") − W("king")

So it's not far fetched to call them vectors. All pictures are from this wonderful post that I very much recommend to read.

answered Oct 13 '17 at 10:14

Maxim

52,561
27
155
209

Thank you very much for your answer. Do you know why the vectors are coming from the origin? – com Oct 13 '17 at 11:09
2

@com the origin doesn't matter a lot in word embeddings. You can safely shift the origin (and thus all vectors) and all these properties above will still hold. What matters is relative positioning of vectors. – Maxim Oct 13 '17 at 11:13
thank you for elaborating the `word linear space` property. I am exploring ways of measuring the "quality" word embedding techniques. Besides the "similar words has closer distance in the embedding space", is there any other property I can think of to measure the quality of embedding techniques? Thank you very much. – lllllllllllll Apr 08 '20 at 06:01

score 1 · Answer 3 · answered Oct 13 '17 at 10:22

Each word is mapped to a point in d-dimension space (d is usually 300 or 600 though not necessary), thus its called a vector (each point in d-dim space is nothing but a vector in that d-dim space).

The points have some nice properties (words with similar meanings tend to occur closer to each other) [proximity is measured using cosine distance between 2 word vectors]

score 1 · Answer 4 · answered Oct 19 '17 at 16:06

Famous Word2Vec implementation is CBOW + Skip-Gram

Your input for CBOW is your input word vector (each is a vector of length N; N = size of vocabulary). All these input word vectors together are an array of size M x N; M=length of words).

Now what is interesting in the graphic below is the projection step, where we force an NN to learn a lower dimensional representation of our input space to predict the output correctly. The desired output is our original input.

This lower dimensional representation P consists of abstract features describing words e.g. location, adjective, etc. (in reality these learned features are not really clear). Now these features represent one view on these words.

And like with all features, we can see them as high-dimensional vectors. If you want you can use dimensionality reduction techniques to display them in 2 or 3 dimensional space.

More details and source of graphic: https://arxiv.org/pdf/1301.3781.pdf

Why are word embedding actually vectors?

4 Answers4

Taking a detour

Linked

Related