I am trying to get my head around dataframes in Spark, and I've hit a point that has me stumped.
I have a dataframe that I've populated from a json file. From this file, I've pulled out a text field to do a clustering exercise. Now that I have the cluster data in hand, I'd like to merge it back to the original dataframe in order to do some further aggregation-type analysis.
Here's what I have so far:
sql = SQLContext(sc)
df = sql.read.json("../data/movie_data.json")
df2 = df.withColumn("synopsis", concat(df.imdb, df.wikipedia))
synopses = []
rows = df2.select(df2.synopsis).collect()
for row in rows:
synopses.append(row.synopsis)
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
min_df=0.2, stop_words='english',
use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))
tfidf_matrix = tfidf_vectorizer.fit_transform(synopses)
km = KMeans(n_clusters=5)
km.fit(tfidf_matrix)
clusters = km.labels_.tolist()
The clusters
collection is a python list of integers, in the same order as the rows in the source dataframe. It would be nice to add "cluster" as a new column to the dataframe, but I don't think that's possible. withColumn seemed like a possibility, but that only works within a dataframe (see this SO question).
I modified this sample from a jupyter notebook that uses pandas dataframes, which are mutable. Is it even possible to do what I'm trying to do with Spark dataframes?