0

I have a spark dataframe and I want to process each sentence(lower, remove punctuation) of each line.

To be more specific:

|text                            |
+--------------------------------+
|  This is a text.I want to split!
+---------------------------------

And I want to get this dataframe:

|text                            |
+---------------------------------
| [this is text ][i want to split]
+---------------------------------
DYZ
  • 55,249
  • 10
  • 64
  • 93
  • Do you have an example code of your attempt ? – gyx-hh Mar 18 '18 at 21:52
  • `SplitSentences = df.withColumn("split_sent",sentencesplit_udf(col('Text'))) SplitSentences = SplitSentences.select(SplitSentences.split_sent)` where `sentencesplit_udf = udf(lambda x: splitSentences(x))` also i have created a function `def clean(text)` If create a udf of this and try to combine it with udf of split it doesn't work –  Mar 18 '18 at 21:55
  • With the function clean I remove stopwords, lower() etc. So I want to "clean" each splitted sentence –  Mar 18 '18 at 22:01
  • Have you seen this solution in SO already ? [link](https://stackoverflow.com/questions/39235704/split-spark-dataframe-string-column-into-multiple-columns) – gyx-hh Mar 18 '18 at 22:02
  • Also next time it’s best to post the example code in the question not as a comment as it’s difficult to read. – gyx-hh Mar 18 '18 at 22:12
  • Yes you are right about that. And I apologize for this. Well, I haven't seen this and it worked. Thank you very much –  Mar 18 '18 at 22:16
  • No problem. Glad it worked :) – gyx-hh Mar 18 '18 at 22:18

0 Answers0