How can I use KMeans to cluster tweets in Spark?

Question

I'd like to cluster tweets based on topic (ex. all Amazon tweets in one cluster, all Netflix tweets in another, etc.) The thing is, all the incoming tweets are already filtered on these keywords, but they're jumbled up, and I'm just categorizing them as they come in.

I'm using Spark streaming and am looking for a way to vectorize these tweets. Because this is batch processing, I don't have access to the entire corpus of tweets.

tweets are text based while k-means works on continuous data. How will you account for this? if the tweets are jumbled up, how is k-means going to help in there? Consider enhancing the question further in a bid to narrow down its focus. — mnm, Jul 31 '18 at 13:21
Basically what I mean is I get a constant stream of tweets, and spark uses a special Streaming-Kmeans that continuously updates the model's centers as new data comes in to better fit it. Say I have 3 kinds of tweets, A, B, C. They can come in like ACBABABCBCCCCBAA etc. and the model has to categorize them in their proper groups. But if suddenly D starts showing up, the cluster center will change to better fit D tweets. Yeah, the tweets are text based, but I need to convert them into Vector representations. — ethereumbrella, Jul 31 '18 at 14:01
clustering is an unsupervised algorithm. This means the number of *k clusters* need to be specified beforehand. Now if you already know the "type of tweets", i.e. you already know the groups "A, B, C", then why do clustering? Besides, the cluster centers will not change to fit the "D" tweet unless you specify it, i.e. you code the algorithm to detect 4 clusters instead of 3. — mnm, Jul 31 '18 at 16:11
Right, but the batches of data coming in each second contain all these tweets jumbled up, so I just need to find a way to recategorize them. Also, this model I'm using has a "forgetfulness" ability that lets it forget past data and accommodate for new data, so the center will realign. Check out StreamingKMeans in Spark. — ethereumbrella, Jul 31 '18 at 16:14
so why not focus on "sorting" the tweets rather then clustering them? Besides, have you seen this somewhat related [post](https://stackoverflow.com/questions/38544777/how-to-find-cluster-centers-of-sparks-streamingkmeans) I think recategorising them each time will bring considerable overhead and is also not an efficient approach. — mnm, Jul 31 '18 at 16:21
How does this differ from your previous question: https://stackoverflow.com/q/51581688/1060350 — Has QUIT--Anony-Mousse, Jul 31 '18 at 17:22
@nilāmbara The main reason for the project is to cluster "new activity" or spikes/trends. Say I'm getting a stream of amazon tweets. All of a sudden, people start complaining about a specific product. I don't think I'd be able to sort that, since it's unknown data. I'd need to cluster it, no? Also I appreciate you taking the time to help, thanks! — ethereumbrella, Jul 31 '18 at 17:28

score 2 · Accepted Answer · answered Jul 31 '18 at 15:26

If you have a predefined vocabulary with potentially multiple terms selected simultaneously - e.g. a set of non-mutually-exclusive tweet categories that you are interested in - then you can have a binary vector in which each bit represents one of the categories.

If the categories are mutually exclusive then what could you hope to achieve by clustering? Specifically there would be no "gray area" in which some observations belong to CategorySet-A, others to CategorySet-B and others to some in-between combination. If every observation is hard-capped at one category than you have discrete points not clusters.

If instead you wish to cluster based on similar sets of words - then you might need to know the "vocabulary" up-front - which in this case means: "what are the tweet terms that I care about". In that case you can use a bag of words model https://machinelearningmastery.com/gentle-introduction-bag-words-model/ to compare the tweets - and then cluster based on the generated vectors.

Now if you are uncertain of the vocabulary apriori - which is the likely case here since you do not know what would be the content of the next tweet - then you will likely resort to re-clustering on a regular basis - as you gain new words. You can then use an updated bag of words that includes the newly "seen" terms. Note that this incurs processing cost and latency. To avoid the cost/latency you have to decide ahead of time which terms to restrict your clustering on: which may be possible if you're interested in a targeted subject.

That link was a godsend. Learned a lot, thank you! The reason for this project is to detect new trending topics that arise with time. This is what happens in order: 1) I get tweets that contain preselected keywords which I have no knowledge about (amazon, netflix, uber), for instance. Suddenly, customers complain about an amazon product and there's a spike. I'd like to cluster those tweets somehow. I'd like to go with the bag of words model you described, but with a Spark implementation. I'm trying out HashingTF, but am wondering how to vectorize each tweet exactly. — ethereumbrella, Jul 31 '18 at 17:22
Try the `CountVectorizer` https://spark.apache.org/docs/2.2.0/ml-features.html#countvectorizer — WestCoastProjects, Jul 31 '18 at 19:02

How can I use KMeans to cluster tweets in Spark?

1 Answers1