0

I have recently been experimenting with Word2Vec and I noticed whilst trawling through forums that a lot of other people are also creating their own vectors from their own databases.

This has made me curious as to how vectors look across databases and whether vectors take a universal orientation?

I understand that the vectors are created as a result of the context they are found in the corpus. So in that sense perhaps you wouldn't expect words to have the same orientation across databases. However, if the language of the documents are constant, then the contexts should be at least somewhat similar across different databases (excluding ambiguous words like bank (for money) and (river) bank). And if they are somewhat similar, it seems plausible that as we look at more commonly occurring words their direction may converge?

SamPassmore
  • 1,221
  • 1
  • 12
  • 32
  • What do you mean by "orientation"? The usual visualization is that each new feature (i.e. each new word, in this context) creates a new dimension in the vector space. – tripleee Jun 12 '15 at 04:14
  • Within Word2Vec, similarity is measured by Cosine distance, which is a measure of the cosine angle between two vectors - meaning it is a judgement of orientation. For example: Two vectors with the same orientation will have a Cosine similarity of 1. So my question revolves around this definition of orientation. Apologies if I have misinterpreted how Word2Vec defines orientation. – SamPassmore Jun 12 '15 at 04:33
  • Yes, that's how it works, but "orientation" does not play a role here. In order for two databases of vectors to be compatible, they have to assign the same dimension to the same feature, if that's what you are asking. The mathematical concept of a vector space generalizes to millions of dimensions, and you can calculate a cosine similarity in this space, but there is no "orientation" in this model. – tripleee Jun 12 '15 at 04:51
  • Based on quick googling "how word2vec defines orientation" is not well-defined at all. If you have documentation where this is actually defined, please include a link. – tripleee Jun 12 '15 at 04:52
  • I don't have a Word2Vec orientation definition - I had assumed that if they used cosine distance there must be some level of orientation involved. But I will research your comment to try and understand this better. A follow-up question: when you refer to dimension, is this related to 'size' argument in word2vec? – SamPassmore Jun 12 '15 at 05:00
  • Perhaps see also http://stackoverflow.com/a/27504795/874188 – tripleee Jun 12 '15 at 05:01
  • I guess by "size" they mean number of dimensions, yes. – tripleee Jun 12 '15 at 05:02
  • So to come full circle: if we have two different databases with the same size / dimension, will a the cosine similarity between two word vectors be the same / similar? i.e. will their similarity be universal? – SamPassmore Jun 12 '15 at 05:07

1 Answers1

1

As outlined in the comments, "orientation" is not a well-defined concept in this context. A traditional word vector space has one dimension for each term.

In order for word vectors to be compatible, they will need to have the same term order. This is typically not the case between different vector collections, unless you build them from exactly the same documents in exactly the same order with exactly the same algorithms.

You could construe "orientation" as "vectors with the same terms in the same order" but the parallel to three-dimensional geometry is already strained as it is. It's probably better to avoid this term.

Given two collections of vectors from reasonably representative input in a known language, the most frequent terms will probably have similar distributions, so you could perhaps derive a mapping from one representation to another with some accuracy (see Zipf's Law). Back in the long tail of rare terms, you will certainly not be able to identify any useful mappings.

tripleee
  • 175,061
  • 34
  • 275
  • 318