Character-level TFIDF in PySpark?

Question

I have a dataset that consists in 10M rows:

>>> df

    name            job                                     company
0   Amanda Arroyo   Herbalist                               Norton-Castillo
1   Victoria Brown  Outdoor activities/education manager    Bowman-Jensen
2   Amy Henry       Chemist, analytical                     Wilkerson, Guerrero and Mason

And I want to calculate the 3-gram character-level tfidf vectors for the column name, like I would easily do with sklearn:

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(analyzer='char', ngram_range=(3, 3))
X = tfidf.fit_transform(df['name'])

The problem is that I can't see any reference to it in the Spark documentation or in the HashingTF API docs.

Is this achievable at all with PySpark?

Matt Andruff · Answer 1 · 2022-04-28T12:56:58.843

1

The tools are available:

TFIDF Spark vs SKlean and ngrams.

Yes it is achievable.

Example of characters being tokenized.

df = spark.createDataFrame([("a b c",)], ["text"])
tokenizer = Tokenizer(outputCol="words")
tokenizer.setInputCol("text")
Tokenizer...
tokenizer.transform(df).head()
Row(text='a b c', words=['a', 'b', 'c'])

edited Apr 28 '22 at 12:56

answered Apr 28 '22 at 12:25

Matt Andruff

4,974
1
5
21

I cannot see in any of the examples how to calculate tfidf on a character-level for the string columns – Iván Sánchez Apr 28 '22 at 12:43
Updated my solution to include more specific examples – Matt Andruff Apr 28 '22 at 12:57
Thank you for the specific example. However, I must insist that these examples focus on the word-level tfidf, as the tokenization is being on a word-level. In the Spark documentation, they use oversimplified string examples of `'a b c'`, which result in the tokens `'a', 'b', 'c'` after tokenizing, but I'm asking how to do a char-level tokenization that given the string `'a b c'`, it would return the char 3-grams `('a b', ' b ', 'b c')` – Iván Sánchez Apr 28 '22 at 13:03
Have you considered that a character is just a string of length 1? Spaces are used for tokenization. (So you likley need to replace all apaces in your string with another character). Take any word you want, "abc" and it can be tokenized into words 'a','b','c'. by using 'a b c'. The point I was trying to make is that it's possible, they're are extra steps of processing to manipulate longer strings into shorter 'words' by inserting spaces to help you tokenize but it's not overly complicated, but it does require more work. Does that make sense ? – Matt Andruff Apr 28 '22 at 13:19
You're right, as long as I use a `udf` that tokenizes the strings into list of characters, I can apply then NGram(3) and HashingTF to the NGram results. I was looking for a more simpler solution but I guess that'll work – Iván Sánchez Apr 28 '22 at 13:21

Character-level TFIDF in PySpark?

1 Answers1