17

I was looking at the example of Spark site for Word2Vec:

val input = sc.textFile("text8").map(line => line.split(" ").toSeq)

val word2vec = new Word2Vec()

val model = word2vec.fit(input)

val synonyms = model.findSynonyms("country name here", 40)

How do I do the interesting vector such as king - man + woman = queen. I can use model.getVectors, but not sure how to proceed further.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
user3803714
  • 5,269
  • 10
  • 42
  • 61

3 Answers3

22

Here is an example in pyspark, which I guess is straightforward to port to Scala - the key is the use of model.transform.

First, we train the model as in the example:

from pyspark import SparkContext
from pyspark.mllib.feature import Word2Vec

sc = SparkContext()
inp = sc.textFile("text8_lines").map(lambda row: row.split(" "))

k = 220         # vector dimensionality
word2vec = Word2Vec().setVectorSize(k)
model = word2vec.fit(inp)

k is the dimensionality of the word vectors - the higher the better (default value is 100), but you will need memory, and the highest number I could go with my machine was 220. (EDIT: Typical values in the relevant publications are between 300 and 1000)

After we have trained the model, we can define a simple function as follows:

def getAnalogy(s, model):
    qry = model.transform(s[0]) - model.transform(s[1]) - model.transform(s[2])
    res = model.findSynonyms((-1)*qry,5) # return 5 "synonyms"
    res = [x[0] for x in res]
    for k in range(0,3):
        if s[k] in res:
            res.remove(s[k])
    return res[0]

Now, here are some examples with countries and their capitals:

s = ('france', 'paris', 'portugal')
getAnalogy(s, model)
# u'lisbon'

s = ('china', 'beijing', 'russia')
getAnalogy(s, model)
# u'moscow'

s = ('spain', 'madrid', 'greece')
getAnalogy(s, model)
# u'athens'

s = ('germany', 'berlin', 'portugal')
getAnalogy(s, model)
# u'lisbon'

s = ('japan', 'tokyo', 'sweden')
getAnalogy(s, model)    
# u'stockholm'

s = ('finland', 'helsinki', 'iran')
getAnalogy(s, model)
# u'tehran'

s = ('egypt', 'cairo', 'finland')
getAnalogy(s, model)
# u'helsinki'

The results are not always correct - I'll leave it to you to experiment, but they get better with more training data and increased vector dimensionality k.

The for loop in the function removes entries that belong to the input query itself, as I noticed that frequently the correct answer was the second one in the returned list, with the first usually being one of the input terms.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • 2
    Can you please specify why you are multiplying by -1 here: res = model.findSynonyms((-1)*qry,5) # return 5 "synonyms" Also, can you write some comments about the for loop in your getAnalogy function. – user3803714 Dec 17 '15 at 03:35
  • 2
    The example with the same dataset did not work as expected. res = getAnalogy(s, model) print "Result is: " + res o/p is: Result is: montpellier – user3803714 Dec 17 '15 at 03:44
  • 3
    1) ``-1`` is just for keeping the ``qry`` order intuitive; you can change this order and remove it 2) Have already provided comments regarding ``for`` loop; try removing it and returning all of ``res`` (instead of just ``res[0]`` to see why it is necessary 3) Already said that the results are not always correct, but they get better with increasing ``k`` (papers use at least ``k=300``); moreover, exact results depend on the random seed. **All in all**, the answer is exactly about the word2vec mathematics, which is what the question was about. – desertnaut Dec 18 '15 at 13:20
  • 2
    @user3803714 Keep also in mind that results shown in publications and demos are always hand-picked, i.e. the mistaken results are simply not shown (although they indeed exist). – desertnaut Dec 18 '15 at 13:30
1
  val w2v_map = sameModel.getVectors//this gives u a map {word:vec}

  val (king, man, woman) = (w2v_map.get("king").get, w2v_map.get("man").get, w2v_map.get("women").get)

  val n = king.length

  //daxpy(n: Int, da: Double, dx: Array[Double], incx: Int, dy: Array[Double], incy: Int);
  blas.saxpy(n,-1,man,1,king,1)

  blas.saxpy(n,1,woman,1,king,1)

  val vec = new DenseVector(king.map(_.toDouble))

  val most_similar_word_to_vector = sameModel.findSynonyms(vec, 10) //they have an api to get synonyms for word, and one for vector
  for((synonym, cosineSimilarity) <- most_similar_word_to_vector) {
    println(s"$synonym $cosineSimilarity")
  }

and the running result as below:

women 0.628454885964967
philip 0.5539534290356802
henry 0.5520055707837214
vii 0.5455116413024774
elizabeth 0.5290994886254643
**queen 0.5162519562606844**
men 0.5133851770249461
wenceslaus 0.5127030522678778
viii 0.5104392579985102
eldest 0.510425791249559
desertnaut
  • 57,590
  • 26
  • 140
  • 166
jay liu
  • 39
  • 2
-3

Here is the pseudo code. For the full implementation, read the documentation: https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/mllib/feature/Word2VecModel.html

  1. w2v_map = model.getVectors() # this gives u a map {word:vec}
  2. my_vector = w2v_map.get('king') - w2v_map.get('man') + w2v_map.get('queen') # do vector algebra here
  3. most_similar_word_to_vector = model.findSynonyms(my_vector, 10) # they have an api to get synonyms for word, and one for vector

edit: https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/mllib/feature/Word2VecModel.html#findSynonyms(org.apache.spark.mllib.linalg.Vector,%20int)

zero323
  • 322,348
  • 103
  • 959
  • 935
jxieeducation
  • 27
  • 1
  • 7
  • 2
    Not clear how to do the vector match. Breeze or Spark vectors? That has is a key component of the question.... – user3803714 Dec 15 '15 at 19:06
  • 2
    public scala.Tuple2 findSynonyms(Vector vector, int num) You do vector match with this method that I listed: https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/mllib/feature/Word2VecModel.html#findSynonyms(org.apache.spark.mllib.linalg.Vector,%20int) – jxieeducation Dec 15 '15 at 19:09