How to create a custom tokenizer in PySpark ML

Question

sentenceDataFrame = spark.createDataFrame([
        (0, "Hi I heard about Spark"),
        (1, "I wish Java could use case classes"),
        (2, "Logistic,regression,models,are,neat")
    ], ["id", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="words") 
tokenized = tokenizer.transform(sentenceDataFrame)

if I run command

tokenized.head()

I would like to get result like this

Row(id=0, sentence='Hi I heard about Spark',
    words=['H','i',' ','h','e',‘a’,……])

However,now the result is that

Row(id=0, sentence='Hi I heard about Spark',
    words=['Hi','I','heard','about','spark'])

Is there any way to achieve this by Tokenizer or RegexTokenizer in PySpark？

similar questions is here：Create a custom Transformer in PySpark ML

score 8 · Accepted Answer · answered Jan 16 '18 at 11:02

8

Have a look at the pyspark.ml documentation. Tokenizer only splits by white spaces, but RegexTokenizer - as the name says - uses a regular expression to find either the split points or the tokens to be extracted (this can be configured by the parameter gaps).

If you pass an empty pattern and leave gaps=True (which is the default) you should get your desired result:

from pyspark.ml.feature import RegexTokenizer

tokenizer = RegexTokenizer(inputCol="sentence", outputCol="words", pattern="")
tokenized = tokenizer.transform(sentenceDataFrame)

answered Jan 16 '18 at 11:02

stefanS

312
4
17

You're welcome. Could you please accept the answer if your problem was solved? – stefanS Jan 17 '18 at 10:53
Sure. Have a happy day – ShuoshuoFan Jan 17 '18 at 14:47

How to create a custom tokenizer in PySpark ML

1 Answers1