Stemming each sentence, of each row of Spark dataframe

Asked Mar 18 '18 at 21:38

Active Mar 18 '18 at 22:11

Viewed 171 times

I have a spark dataframe and I want to process each sentence(lower, remove punctuation) of each line.

To be more specific:

|text                            |
+--------------------------------+
|  This is a text.I want to split!
+---------------------------------

And I want to get this dataframe:

|text                            |
+---------------------------------
| [this is text ][i want to split]
+---------------------------------

edited Mar 18 '18 at 22:11

DYZ

asked Mar 18 '18 at 21:38

Do you have an example code of your attempt ? – gyx-hh Mar 18 '18 at 21:52
`SplitSentences = df.withColumn("split_sent",sentencesplit_udf(col('Text'))) SplitSentences = SplitSentences.select(SplitSentences.split_sent)` where `sentencesplit_udf = udf(lambda x: splitSentences(x))` also i have created a function `def clean(text)` If create a udf of this and try to combine it with udf of split it doesn't work – Mar 18 '18 at 21:55
With the function clean I remove stopwords, lower() etc. So I want to "clean" each splitted sentence – Mar 18 '18 at 22:01
Have you seen this solution in SO already ? [link](https://stackoverflow.com/questions/39235704/split-spark-dataframe-string-column-into-multiple-columns) – gyx-hh Mar 18 '18 at 22:02
Also next time it’s best to post the example code in the question not as a comment as it’s difficult to read. – gyx-hh Mar 18 '18 at 22:12
Yes you are right about that. And I apologize for this. Well, I haven't seen this and it worked. Thank you very much – Mar 18 '18 at 22:16
No problem. Glad it worked :) – gyx-hh Mar 18 '18 at 22:18

0 Answers0