0

I've got a DF with "b" column with a patter 'a|b|c|...|z' like this:

from pyspark import Row
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName('DataFrame') \
    .master('local[*]') \
    .getOrCreate()

|  a|          b|           c|    d|
+---+-----------+------------+-----+
|  1|1|2|3|4|5|6|[11, 22, 33]|[foo]|
+---+-----------+------------+-----+

I would like to change the "b" column to a list in order to next explode it and do additional processing, so it should look like this:

|  a|                 b|           c|    d|
+---+------------------+------------+-----+
|  1|[1, 2, 3, 4, 5, 6]|[11, 22, 33]|[foo]|
+---+------------------+------------+-----+

Hope you can help.

cincin21
  • 550
  • 1
  • 11
  • 26
  • 1
    Possible duplicate of https://stackoverflow.com/questions/41283478/split-contents-of-string-column-in-pyspark-dataframe – giser_yugang Jul 08 '19 at 11:41
  • @giser_yugang I've tried "df.withColumn("b", split("b", "|"))", but not what I am looking for as it creates: "[1, |, 2, |, 3, |..." – cincin21 Jul 08 '19 at 11:58
  • @giser_yugang Later I've tried "df.withColumn("b", split("b", "\|"))" and it has worked! Great, thank you! – cincin21 Jul 08 '19 at 12:05

1 Answers1

0

Thanks to @giser_yugang the solution for my topic:

from pyspark.sql.functions import split

df.withColumn("b", split("b", "\|"))
cincin21
  • 550
  • 1
  • 11
  • 26