I have a dataset that consists in 10M rows:
>>> df
name job company
0 Amanda Arroyo Herbalist Norton-Castillo
1 Victoria Brown Outdoor activities/education manager Bowman-Jensen
2 Amy Henry Chemist, analytical Wilkerson, Guerrero and Mason
And I want to calculate the 3-gram character-level tfidf vectors for the column name
, like I would easily do with sklearn:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(analyzer='char', ngram_range=(3, 3))
X = tfidf.fit_transform(df['name'])
The problem is that I can't see any reference to it in the Spark documentation or in the HashingTF API docs.
Is this achievable at all with PySpark?