Pyspark DataFrame: Split column with multiple values into rows

Question

I have a dataframe (with more rows and columns) as shown below.

Sample DF:

from pyspark import Row
from pyspark.sql import SQLContext
from pyspark.sql.functions import explode

sqlc = SQLContext(sc)

df = sqlc.createDataFrame([Row(col1 = 'z1', col2 = '[a1, b2, c3]', col3 = 'foo')])
# +------+-------------+------+
# |  col1|         col2|  col3|
# +------+-------------+------+
# |    z1| [a1, b2, c3]|   foo|
# +------+-------------+------+

df
# DataFrame[col1: string, col2: string, col3: string]

What I want:

+-----+-----+-----+
| col1| col2| col3|
+-----+-----+-----+
|   z1|   a1|  foo|
|   z1|   b2|  foo|
|   z1|   c3|  foo|
+-----+-----+-----+

I tried to replicate the RDD solution provided here: Pyspark: Split multiple array columns into rows

(df
    .rdd
    .flatMap(lambda row: [(row.col1, col2, row.col3) for col2 in row.col2)])
    .toDF(["col1", "col2", "col3"]))

However, it is not giving the required result

Edit: The explode option does not work because it is currently stored as string and the explode function expects an array

You're looking for `explode` after converting the string to an array, which can be done with `split` and `regexp_replace`. — pault, Jul 17 '19 at 02:27
Possible duplicate of [Explode array data into rows in spark](https://stackoverflow.com/questions/44436856/explode-array-data-into-rows-in-spark) — pault, Jul 17 '19 at 02:28
Example of how to convert the string to an array [here](https://stackoverflow.com/questions/49681837/convert-stringtype-to-arraytype-in-pyspark) — pault, Jul 17 '19 at 02:32
@pault - `explode` will not work since all my columns are stored as string — SmitM, Jul 17 '19 at 15:54

score 4 · Accepted Answer · answered Jul 17 '19 at 16:05

You can use explode but first you'll have to convert the string representation of the array into an array.

One way is to use regexp_replace to remove the leading and trailing square brackets, followed by split on ", ".

from pyspark.sql.functions import col, explode, regexp_replace, split

df.withColumn(
    "col2", 
    explode(split(regexp_replace(col("col2"), "(^\[)|(\]$)", ""), ", "))
).show()
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#|  z1|  a1| foo|
#|  z1|  b2| foo|
#|  z1|  c3| foo|
#+----+----+----+

abiratsis · Answer 2 · 2019-07-17T21:48:54.453

2

Pault's solution should work perfectly fine although here is another solution which uses regexp_extract instead (you don't really need to replace anything in this case) and it can handle arbitrary number of spaces:

from pyspark.sql.functions import col, explode, regexp_extract,regexp_replace, split

df.withColumn("col2", 
              explode(
                  split(
                      regexp_extract(
                        regexp_replace(col("col2"), "\s", ""), "^\[(.*)\]$", 1), ","))) \
.show()

Explanation:

Initially regexp_replace(col("col2"), "\s", "") will replace all spaces with empty string.
Next regexp_extract will extract the content of the column which start with [ and ends with ].
Then we execute split for the comma separated values and finally explode.

edited Jul 17 '19 at 21:48

answered Jul 17 '19 at 18:04

abiratsis

7,051
3
28
46

*you don't really need to replace anything in this case* but your first bullet is replacing spaces. You should be able to get around this and maintain support for arbitrary number of spaces by splitting on `", +"`. – pault Jul 17 '19 at 19:27
actually I am removing spaces pault, what I meant is that even if you can achieve the same with regexp_replace the goal here is to extract a string hence regexp_exctract can be more suitable. – abiratsis Jul 17 '19 at 21:48
1

Sure, but what if the array was `[a1, an element with spaces, c3]`? Only OP can provide the requirements- which isn't to say your method is not useful- it does work for the given test case after all (+1). – pault Jul 17 '19 at 21:53

Pyspark DataFrame: Split column with multiple values into rows

2 Answers2