How to get a dataframe from word2vec model using spark

Question

I am currently working on a sparkling water application and I am a total beginner in spark and h2o.

What I want to do:

loading a input textfile
create a word2vec model
create a dataframe with a column word and a column Vector
using the dataframe as input for h2o

By creating the model i get a map, but i don't know how to create a dataframe of it. The output should look like that:

word | Vector

assert | [0.3, 0.4.....]

sense | [0.6, 0.2.....] and so on.

This is my code so far:

from pyspark import SparkContext
from pyspark.mllib.feature import Word2Vec
from pysparkling import *
import h2o

from pyspark.sql import SQLContext
from pyspark.mllib.linalg import Vectors
from pyspark.sql import Row


# Starting h2o application on spark cluster
hc = H2OContext(sc).start()

# Loading input file
inp = sc.textFile("examples/custom/text8.txt").map(lambda row: row.split(" "))

# building the word2vec model with a vector size of 10
word2vec = Word2Vec()
model = word2vec.setVectorSize(10).fit(inp)

# Sanity check
model.findSynonyms("property",5)

# assign vector representation (map to variable
wordVectorsDF = model.getVectors()

# Transform wordVectorsDF word into dataframe

Is there any approach to that or functions provided by spark?

Thanks in advance

score 3 · Accepted Answer · answered Jul 13 '16 at 17:36

3

I found out that there are two libraries for a Word2Vec transformation - I don't know why.

from pyspark.mllib.feature import Word2Vec
from pyspark.ml.feature import Word2Vec

The second line returns a data frame with the function getVectors()and has diffenrent parameters for building a model from the first line.

Maybe somebody can comment on that concerning the 2 different libraries.

Thanks in advance.

answered Jul 13 '16 at 17:36

sedioben

935
1
10
16

1

There are 2 ML packages in Spark currently as one is the old one and the second one (aimed at DataFrame) is the new on - the old one is there for backwards compatibility – Mateusz Dymczyk Jul 13 '16 at 17:46
This answer puts more light on the differences: https://stackoverflow.com/q/38835829/2650427 – TrigonaMinima Jun 10 '23 at 21:00

score -1 · Answer 2 · answered Jun 30 '16 at 19:15

First of all in H2O we don't support a Vector column type, you'd have to make a frame like this:

word   | V1  | V2  | ...
assert | 0.3 | 0.4 | ...
sense  | 0.6 | 0.2 | ...

Now for the actual question - no, since it's a Scala Map, we provide ways to create frames from data sources (files on HDFS/S3, databases etc) or conversions from RDDs/DataFrames but not from Java/Scala collections. Writing one would be possible but quite cumbersome.

Not the most performant solution but the easiest code-wise would be to make a DF (or RDD) first (by running sc.parallelize on map.toSeq) and then convert it to an H2OFrame:

import hc._
val wordsDF = sc.parallelize(wordVectorsDF.toSeq).toDF
val h2oFrame = asH2OFrame(wordsDF)

Thanks for the help, these steps in the code are running on spark and spark supports vector as column type. In future steps H2O it self transforms that like the way you already mentioned above. — sedioben, Jul 13 '16 at 17:40

How to get a dataframe from word2vec model using spark

2 Answers2