-1

I have a spark data frame as below and would like to split the the column into 3 by space.

+------------+
|        text|
+------------+
|  aaa bb ccc|
+------------+
|  aaa bb c d|
+------------+
|        aa b|
+------------+

Below is the expected outcome. The first item stays in text1 column, second item goes to text2 and the rest all go to text3 if any. The original column value could have null records or values with any numbers of splitter, which is the space, " ".

+------------+-----+-----+-----+
|        text|text1|text2|text3|
+------------+-----+-----+-----+
|  aaa bb ccc| aaa | bb  | ccc |
+------------+-----+-----+-----+
|  aaa bb c d| aaa | bb  | c d |
+------------+-----+-----+-----+
|        aa b| aa  | b   | null|
+------------+-----+-----+-----+
|        aa  | aa  |null | null|
+------------+-----+-----+-----+
|            | null|null | null|
+------------+-----+-----+-----+

Thanks in advance!

MAMS
  • 419
  • 1
  • 6
  • 17
  • 4
    Does this answer your question? [Split Spark Dataframe string column into multiple columns](https://stackoverflow.com/questions/39235704/split-spark-dataframe-string-column-into-multiple-columns) – blackbishop Nov 14 '21 at 10:34
  • Thanks. The solution suggesting passing the limit argument should work. It it's the same as what below answer suggested. However, I am getting the error saying split can take only 2 arguments when I pass the 3rd argument to indicate the limit. – MAMS Nov 14 '21 at 15:50

1 Answers1

1

You can use the split function.

arr_cols = [F.split('text', ' ', 3)[i].alias('text' + str(i+1)) for i in range(3)]
df = df.select('text', *arr_cols)
df.show(truncate=False)
过过招
  • 3,722
  • 2
  • 4
  • 11
  • Thanks. This should work. However, I am getting below complaining split can take only 2 arguments but I am passing the 3rd one, which is the limit argument, "3". Is this because of the Pyspark version? arr_cols = [F.split('text', ' ', 3)[i].alias('text' + str(i+1)) for i in range(3)] TypeError: split() takes 2 positional arguments but 3 were given – MAMS Nov 14 '21 at 15:46
  • Yes,Changed in version 3.0: split now takes an optional limit field. If not provided, default limit value is -1. – 过过招 Nov 15 '21 at 01:04
  • Just checked that my spark version is 2.3.0.2.6.5.65-2. Is there a way to pass limit argument in split function or using other functions to achieve the same objective? – MAMS Nov 15 '21 at 02:26
  • In my limited experience with version 2.3, I am afraid that there is no ready-made function to use, and it needs to be implemented through ```UDF```. – 过过招 Nov 15 '21 at 02:45