How to stack two columns into a single one in PySpark?

Question

I have the following PySpark DataFrame:

id   col1   col2
A    2      3
A    2      4
A    4      6
B    1      2

I want to stack col1 and col2 in order to get a single column as follows:

id   col3
A    2   
A    3
A    4
A    6
B    1
B    2

How can I do so?

df = (
    sc.parallelize([
        (A, 2, 3), (A, 2, 4), (A, 4, 6),
        (B, 1, 2),
    ]).toDF(["id", "col1", "col2"])
)

did you miss something? in `col3` you have 6 elements, while in the original two you have 8 elements in total — Sergey Ronin, Jan 09 '20 at 23:13

score 9 · Accepted Answer · answered Jan 10 '20 at 03:33

The simplest is merge col1 and col2 into an array column and then explode it:

df.show()
+---+----+----+
| id|col1|col2|
+---+----+----+
|  A|   2|   3|
|  A|   2|   4|
|  A|   4|   6|
|  B|   1|   2|
+---+----+----+

df.selectExpr('id', 'explode(array(col1, col2))').show()
+---+---+
| id|col|
+---+---+
|  A|  2|
|  A|  3|
|  A|  2|
|  A|  4|
|  A|  4|
|  A|  6|
|  B|  1|
|  B|  2|
+---+---+

You can drop duplicates if you don't need them.

score 1 · Answer 2 · answered Jan 09 '20 at 23:41

To do this, group by the "id", then collect the lists from both "col1" and "col2" in an aggregation, to then explode it again into one column. To get the unique numbers, just drop the duplicates after.

I see that you also have the numbers sorted in your end result, this is done by sorting the concatted lists in the aggregation.

The following code:

from pyspark.sql.functions import concat, collect_list, explode, col, sort_array

df = (
    sc.parallelize([
        ('A', 2, 3), ('A', 2, 4), ('A', 4, 6),
        ('B', 1, 2),
    ]).toDF(["id", "col1", "col2"])
)

result = df.groupBy("id") \
.agg(sort_array(concat(collect_list("col1"),collect_list("col2"))).alias("all_numbers")) \
.orderBy("id") \
.withColumn('number', explode(col('all_numbers'))) \
.dropDuplicates() \
.select("id","number") \
.show()

will yield:

+---+------+
| id|number|
+---+------+
|  A|     2|
|  A|     3|
|  A|     4|
|  A|     6|
|  B|     1|
|  B|     2|
+---+------+

score 1 · Answer 3 · answered Jan 10 '20 at 09:39

Rather a simple solution if the number of columns involved is less.

df = (
    sc.parallelize([
        ('A', 2, 3), ('A', 2, 4), ('A', 4, 6),
        ('B', 1, 2),
    ]).toDF(["id", "col1", "col2"])
)


df.show()

+---+----+----+
| id|col1|col2|
+---+----+----+
|  A|   2|   3|
|  A|   2|   4|
|  A|   4|   6|
|  B|   1|   2|
+---+----+----+

df1 = df.select(['id', 'col1'])
df2 = df.select(['id', 'col2']).withColumnRenamed('col2', 'col1')

df_new = df1.union(df2)
df_new = df_new.drop_duplicates()
df_new.show()

+---+----+
| id|col1|
+---+----+
|  A|   3|
|  A|   4|
|  B|   1|
|  A|   6|
|  A|   2|
|  B|   2|
+---+----+

How to stack two columns into a single one in PySpark?

3 Answers3

Linked

Related