You can try this pipeline:
First, tokenize the input Tweet (located in the column text
). basically, it creates a new column rawWords
as a list of words taken from the original text. To get these words, it splits the input text by alphanumeric words (.setPattern("\\w+").setGaps(false)
)
val tokenizer = new RegexTokenizer()
.setInputCol("text")
.setOutputCol("rawWords")
.setPattern("\\w+")
.setGaps(false)
Secondly, you may consider remove the stop words to remove less significant words in the text, such as a, the, of, etc.
val stopWordsRemover = new StopWordsRemover()
.setInputCol("rawWords")
.setOutputCol("words")
Now it's time to vectorize the words
column. In this example I'm using the CountVectorizer
which is quite basic. There are many others such as the TF-ID Vectorizer
. You can find more information here.
I've configured the CountVectorizer
so that it creates a vocabulary with 10,000 words, each word appearing a minimum of 5 times across all document, and a minimum of 1 on each document.
val countVectorizer = new CountVectorizer()
.setInputCol("words")
.setOutputCol("features")
.setVocabSize(10000)
.setMinDF(5.0)
.setMinTF(1.0)
Finally, just create the pipeline, and fit and transform the model generated by the pipeline by passing the dataset.
val transformPipeline = new Pipeline()
.setStages(Array(
tokenizer,
stopWordsRemover,
countVectorizer))
transformPipeline.fit(training).transform(test)
Hope it helps.