Similarly to what happens in scikit-learn TfidfVectorizer(min_df=20, max_df=0.5, ngram_range=(1,3))
I want to calculate de term-frequencies of my text data and consider uni-grams, bi-grams and tri-grams.
Since I'm new with pyspark, I'm not entirely sure this solution does it, but here is what I have now, which produces a vector with the combined TF-IDF's of each n-gram.
def build_trigrams(inputCol="filtered", n=3):
ngrams = [
NGram(n=i, inputCol="filtered", outputCol="{0}_grams".format(i))
for i in range(1, n + 1)
]
cv = [
CountVectorizer(minDF=20, maxDF=0.5 ,inputCol="{0}_grams".format(i),
outputCol="{0}_tf".format(i))
for i in range(1, n + 1)
]
idf = [IDF(inputCol="{0}_tf".format(i), outputCol="{0}_tfidf".format(i), minDocFreq=5) for i in range(1, n + 1)]
assembler = [VectorAssembler(
inputCols=["{0}_tfidf".format(i) for i in range(1, n + 1)],
outputCol="features"
)]
return Pipeline(stages=ngrams + cv + idf + assembler)
Now, similarly to what happens in this question, I want to see the features in a dataframe in a equal manner as if I did:
features = tfidf.fit_transform(data['desciption'])
data_TF_IDF = pd.DataFrame(features.todense(),columns=tfidf.get_feature_names())
So that I could see the TF-IDF of the n-grams from the text data. The problem is that I don't know how to do it when having to combine multiple CountVectorizers
as displayed in the function above.