TF-IDF containing 1 to 3 n-grams using pyspark

Question

Similarly to what happens in scikit-learn TfidfVectorizer(min_df=20, max_df=0.5, ngram_range=(1,3)) I want to calculate de term-frequencies of my text data and consider uni-grams, bi-grams and tri-grams.

Since I'm new with pyspark, I'm not entirely sure this solution does it, but here is what I have now, which produces a vector with the combined TF-IDF's of each n-gram.

    def build_trigrams(inputCol="filtered", n=3):
    
    ngrams = [
        NGram(n=i, inputCol="filtered", outputCol="{0}_grams".format(i))
        for i in range(1, n + 1)
    ]
 
    cv = [
        CountVectorizer(minDF=20, maxDF=0.5 ,inputCol="{0}_grams".format(i),
            outputCol="{0}_tf".format(i))
        for i in range(1, n + 1)
    ]
    
    idf = [IDF(inputCol="{0}_tf".format(i), outputCol="{0}_tfidf".format(i), minDocFreq=5) for i in range(1, n + 1)]
 
    assembler = [VectorAssembler(
        inputCols=["{0}_tfidf".format(i) for i in range(1, n + 1)],
        outputCol="features"
    )]
    
    return Pipeline(stages=ngrams + cv + idf + assembler)

Now, similarly to what happens in this question, I want to see the features in a dataframe in a equal manner as if I did:

features = tfidf.fit_transform(data['desciption'])
data_TF_IDF = pd.DataFrame(features.todense(),columns=tfidf.get_feature_names())

So that I could see the TF-IDF of the n-grams from the text data. The problem is that I don't know how to do it when having to combine multiple CountVectorizers as displayed in the function above.

TF-IDF containing 1 to 3 n-grams using pyspark

0 Answers0