sentenceDataFrame = spark.createDataFrame([
(0, "Hi I heard about Spark"),
(1, "I wish Java could use case classes"),
(2, "Logistic,regression,models,are,neat")
], ["id", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
tokenized = tokenizer.transform(sentenceDataFrame)
if I run command
tokenized.head()
I would like to get result like this
Row(id=0, sentence='Hi I heard about Spark',
words=['H','i',' ','h','e',‘a’,……])
However,now the result is that
Row(id=0, sentence='Hi I heard about Spark',
words=['Hi','I','heard','about','spark'])
Is there any way to achieve this by Tokenizer or RegexTokenizer in PySpark?
similar questions is here:Create a custom Transformer in PySpark ML