-1

I'd like to turn tweets into vectors for machine learning, so that I can categorize them based on content using Spark's K-Means clustering. Ex, all tweets relating to Amazon get put into one category.

I have tried splitting the tweet into words and creating a vector using HashingTF, which wasn't very successful.

Are there any other ways to vectorize tweets?

Pierre Gourseaud
  • 2,347
  • 13
  • 24
ethereumbrella
  • 83
  • 2
  • 10

1 Answers1

1

You can try this pipeline:

First, tokenize the input Tweet (located in the column text). basically, it creates a new column rawWords as a list of words taken from the original text. To get these words, it splits the input text by alphanumeric words (.setPattern("\\w+").setGaps(false))

val tokenizer = new RegexTokenizer()
 .setInputCol("text")
 .setOutputCol("rawWords")
 .setPattern("\\w+")
 .setGaps(false)

Secondly, you may consider remove the stop words to remove less significant words in the text, such as a, the, of, etc.

val stopWordsRemover = new StopWordsRemover()
 .setInputCol("rawWords")
 .setOutputCol("words")

Now it's time to vectorize the wordscolumn. In this example I'm using the CountVectorizerwhich is quite basic. There are many others such as the TF-ID Vectorizer. You can find more information here.

I've configured the CountVectorizerso that it creates a vocabulary with 10,000 words, each word appearing a minimum of 5 times across all document, and a minimum of 1 on each document.

val countVectorizer = new CountVectorizer()
 .setInputCol("words")
 .setOutputCol("features")
 .setVocabSize(10000)
 .setMinDF(5.0)
 .setMinTF(1.0)

Finally, just create the pipeline, and fit and transform the model generated by the pipeline by passing the dataset.

val transformPipeline = new Pipeline()
 .setStages(Array(
   tokenizer,
   stopWordsRemover,
   countVectorizer))

transformPipeline.fit(training).transform(test)

Hope it helps.

Álvaro Valencia
  • 1,187
  • 8
  • 17
  • Thank you for your help! I hadn't considered CountVectorizer, seems useful. I decided to go the Streaming route - I'll be streaming batches of tweets and using the StreamingKMeans model in Spark. Will it be possible to update the CountVectorizer model in real time as new data comes in? – ethereumbrella Jul 30 '18 at 12:22
  • I'm not totally used to the streaming API, but I think you can update the model by calling the `fit` method of the pipeline, that will generate a new model but you have to pass all the tweets you've processed until that moment. – Álvaro Valencia Jul 30 '18 at 12:30
  • Check it out: https://stackoverflow.com/questions/48749717/ml-model-update-in-spark-streaming – Álvaro Valencia Jul 30 '18 at 12:32
  • Thanks for the link - I've checked all the content there and wasn't sure which direction to take. I came across this https://stackoverflow.com/questions/40996430/how-to-use-feature-extraction-with-dstream-in-apache-spark . Do you think this method might work? – ethereumbrella Jul 30 '18 at 14:06
  • I'd suggest you to open a new question on that. Unfortunately, currently I can't suggest you so much on the Streaming API since I'm still not used to it. Regarding this question, the only thing I can suggest you is to use the pipeline I've shown you, and then find a way to optimally refresh the pipeline model (you can save it with the `write`method and train it again with the `fit` method). – Álvaro Valencia Jul 31 '18 at 11:41