1

I am trying to get my head around dataframes in Spark, and I've hit a point that has me stumped.

I have a dataframe that I've populated from a json file. From this file, I've pulled out a text field to do a clustering exercise. Now that I have the cluster data in hand, I'd like to merge it back to the original dataframe in order to do some further aggregation-type analysis.

Here's what I have so far:

sql = SQLContext(sc)

df = sql.read.json("../data/movie_data.json")
df2 = df.withColumn("synopsis", concat(df.imdb, df.wikipedia))

synopses = []
rows = df2.select(df2.synopsis).collect()
for row in rows:
    synopses.append(row.synopsis)

tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                   min_df=0.2, stop_words='english',
                                   use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))

tfidf_matrix = tfidf_vectorizer.fit_transform(synopses)
km = KMeans(n_clusters=5)
km.fit(tfidf_matrix)
clusters = km.labels_.tolist()

The clusters collection is a python list of integers, in the same order as the rows in the source dataframe. It would be nice to add "cluster" as a new column to the dataframe, but I don't think that's possible. withColumn seemed like a possibility, but that only works within a dataframe (see this SO question).

I modified this sample from a jupyter notebook that uses pandas dataframes, which are mutable. Is it even possible to do what I'm trying to do with Spark dataframes?

Community
  • 1
  • 1
Stuart
  • 1,572
  • 4
  • 21
  • 39
  • I am confused. Do you want to use SciKit model with PySpark? – zero323 Jan 12 '16 at 16:10
  • Nope, that's not the problem. I didn't include the imports in the code snippet. I'm using sklearn via an anaconda install and the tokenize_and_stem function is a custom function leveraging nltk. – Stuart Jan 12 '16 at 17:07
  • It is extremely artificial scenario. If your data fits in the memory of a local machine using Spark DataFrame doesn't makes sense. – zero323 Jan 12 '16 at 17:35
  • Yes, it is, but isn't it common to develop an idea locally with a small dataset? A second question, if I get past this one, is would the idea translate to a large corpus on a hadoop cluster? That is what we are ultimately learning to develop for. – Stuart Jan 12 '16 at 21:21
  • It is, but using local data structure of size O(N) simply won't scale. Typically you perform operations like this locally (predict in `map` instead of outside), `join` or `zip` on distributed data structures and there is no need for something like this. – zero323 Jan 12 '16 at 21:58

0 Answers0