Change string pattern in a column to list

Question

I've got a DF with "b" column with a patter 'a|b|c|...|z' like this:

from pyspark import Row
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName('DataFrame') \
    .master('local[*]') \
    .getOrCreate()

|  a|          b|           c|    d|
+---+-----------+------------+-----+
|  1|1|2|3|4|5|6|[11, 22, 33]|[foo]|
+---+-----------+------------+-----+

I would like to change the "b" column to a list in order to next explode it and do additional processing, so it should look like this:

|  a|                 b|           c|    d|
+---+------------------+------------+-----+
|  1|[1, 2, 3, 4, 5, 6]|[11, 22, 33]|[foo]|
+---+------------------+------------+-----+

Hope you can help.

Possible duplicate of https://stackoverflow.com/questions/41283478/split-contents-of-string-column-in-pyspark-dataframe — giser_yugang, Jul 08 '19 at 11:41
@giser_yugang I've tried "df.withColumn("b", split("b", "|"))", but not what I am looking for as it creates: "[1, |, 2, |, 3, |..." — cincin21, Jul 08 '19 at 11:58
@giser_yugang Later I've tried "df.withColumn("b", split("b", "\|"))" and it has worked! Great, thank you! — cincin21, Jul 08 '19 at 12:05

cincin21 · Accepted Answer · 2019-07-08T12:19:21.757

0

Thanks to @giser_yugang the solution for my topic:

from pyspark.sql.functions import split

df.withColumn("b", split("b", "\|"))

edited Jul 08 '19 at 12:19

answered Jul 08 '19 at 12:10

cincin21

550
1
11
26

Change string pattern in a column to list

1 Answers1