4

I have a PySpark dataframe with a column that contains comma separated values. The number of values that the column contains is fixed (say 4). Example:

+----+----------------------+
|col1|                  col2|
+----+----------------------+
|   1|val1, val2, val3, val4|
|   2|val1, val2, val3, val4|
|   3|val1, val2, val3, val4|
|   4|val1, val2, val3, val4|
+----+----------------------+

Here I want to split col2 into 4 separate columns as shown below:

+----+-------+-------+-------+-------+
|col1|  col21|  col22|  col23|  col24|
+----+-------+-------+-------+-------+
|   1|   val1|   val2|   val3|   val4|
|   2|   val1|   val2|   val3|   val4|
|   3|   val1|   val2|   val3|   val4|
|   4|   val1|   val2|   val3|   val4|
+----+-------+-------+-------+-------+

How can this be done?

Pierre Gourseaud
  • 2,347
  • 13
  • 24
exAres
  • 4,806
  • 16
  • 53
  • 95
  • 4
    Possible duplicate of [Split Spark Dataframe string column into multiple columns](https://stackoverflow.com/questions/39235704/split-spark-dataframe-string-column-into-multiple-columns) – Florian Aug 03 '18 at 11:44
  • 1
    I posted [an answer](https://stackoverflow.com/a/51680292/5858851) on the linked duplicate that shows how to do this for the general case without using a `udf` or `collect`. – pault Aug 03 '18 at 21:31

1 Answers1

13

I would split the column and make each element of the array a new column.

from pyspark.sql import functions as F

df = spark.createDataFrame(sc.parallelize([['1', 'val1, val2, val3, val4'], ['2', 'val1, val2, val3, val4'], ['3', 'val1, val2, val3, val4'], ['4', 'val1, val2, val3, val4']]), ["col1", "col2"])

df2 = df.select('col1', F.split('col2', ', ').alias('col2'))

# If you don't know the number of columns:
df_sizes = df2.select(F.size('col2').alias('col2'))
df_max = df_sizes.agg(F.max('col2'))
nb_columns = df_max.collect()[0][0]

df_result = df2.select('col1', *[df2['col2'][i] for i in range(nb_columns)])
df_result.show()
>>>
+----+-------+-------+-------+-------+
|col1|col2[0]|col2[1]|col2[2]|col2[3]|
+----+-------+-------+-------+-------+
|   1|   val1|   val2|   val3|   val4|
|   2|   val1|   val2|   val3|   val4|
|   3|   val1|   val2|   val3|   val4|
|   4|   val1|   val2|   val3|   val4|
+----+-------+-------+-------+-------+
Pierre Gourseaud
  • 2,347
  • 13
  • 24