Here is an example in pyspark
, which I guess is straightforward to port to Scala - the key is the use of model.transform
.
First, we train the model as in the example:
from pyspark import SparkContext
from pyspark.mllib.feature import Word2Vec
sc = SparkContext()
inp = sc.textFile("text8_lines").map(lambda row: row.split(" "))
k = 220 # vector dimensionality
word2vec = Word2Vec().setVectorSize(k)
model = word2vec.fit(inp)
k
is the dimensionality of the word vectors - the higher the better (default value is 100), but you will need memory, and the highest number I could go with my machine was 220. (EDIT: Typical values in the relevant publications are between 300 and 1000)
After we have trained the model, we can define a simple function as follows:
def getAnalogy(s, model):
qry = model.transform(s[0]) - model.transform(s[1]) - model.transform(s[2])
res = model.findSynonyms((-1)*qry,5) # return 5 "synonyms"
res = [x[0] for x in res]
for k in range(0,3):
if s[k] in res:
res.remove(s[k])
return res[0]
Now, here are some examples with countries and their capitals:
s = ('france', 'paris', 'portugal')
getAnalogy(s, model)
# u'lisbon'
s = ('china', 'beijing', 'russia')
getAnalogy(s, model)
# u'moscow'
s = ('spain', 'madrid', 'greece')
getAnalogy(s, model)
# u'athens'
s = ('germany', 'berlin', 'portugal')
getAnalogy(s, model)
# u'lisbon'
s = ('japan', 'tokyo', 'sweden')
getAnalogy(s, model)
# u'stockholm'
s = ('finland', 'helsinki', 'iran')
getAnalogy(s, model)
# u'tehran'
s = ('egypt', 'cairo', 'finland')
getAnalogy(s, model)
# u'helsinki'
The results are not always correct - I'll leave it to you to experiment, but they get better with more training data and increased vector dimensionality k
.
The for
loop in the function removes entries that belong to the input query itself, as I noticed that frequently the correct answer was the second one in the returned list, with the first usually being one of the input terms.