3
sentenceDataFrame = spark.createDataFrame([
        (0, "Hi I heard about Spark"),
        (1, "I wish Java could use case classes"),
        (2, "Logistic,regression,models,are,neat")
    ], ["id", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="words") 
tokenized = tokenizer.transform(sentenceDataFrame)

if I run command

tokenized.head()

I would like to get result like this

Row(id=0, sentence='Hi I heard about Spark',
    words=['H','i',' ','h','e',‘a’,……])

However,now the result is that

Row(id=0, sentence='Hi I heard about Spark',
    words=['Hi','I','heard','about','spark'])

Is there any way to achieve this by Tokenizer or RegexTokenizer in PySpark?

similar questions is here:Create a custom Transformer in PySpark ML

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
ShuoshuoFan
  • 63
  • 1
  • 9

1 Answers1

8

Have a look at the pyspark.ml documentation. Tokenizer only splits by white spaces, but RegexTokenizer - as the name says - uses a regular expression to find either the split points or the tokens to be extracted (this can be configured by the parameter gaps).

If you pass an empty pattern and leave gaps=True (which is the default) you should get your desired result:

from pyspark.ml.feature import RegexTokenizer

tokenizer = RegexTokenizer(inputCol="sentence", outputCol="words", pattern="")
tokenized = tokenizer.transform(sentenceDataFrame)
stefanS
  • 312
  • 4
  • 17