PySpark merge 2 column values by index into new list

Question

Lets say i have a dataframe like this:

+--------------------+--------------------+
|              Fruits|               Count|
+--------------------+--------------------+
|[Pear, Orange]      |[1,2]               |
+--------------------+--------------------+
|[Orange, Pear]      |[2,1]               |
+--------------------+--------------------+
|[Orange, Pear]      |[2,1]               | 
+--------------------+--------------------+

I want another column with the merged info

+--------------------+------------+----------------------------+
|              Fruits|    Count   |     merged                 |
+--------------------+------------+----------------------------+
|[Pear, Orange]      |[1,2]       |[('Pear',1),('Orange',2)]   | 
+--------------------+-----------------------------------------+
|[Pear, Orange]      |[2,1]       |[('Pear',2),('Orange',1)]   |                       
+--------------------+-----------------------------------------+
|[Orange, Pear]      |[2,1]       |[('Pear',1),('Orange',2)]   |                        
+--------------------+-----------------------------------------+

I showed the 3rd row cos im hoping my merged column can be first create into tuple and then sorted.

Is there a function in PySpark that can do this?

I know we can merge cols through here: pyspark - merge 2 columns of sets but i want it merged on a dictionary approach rather than concat..

For spark 2.4+ use [`arrays_zip`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=explode#pyspark.sql.functions.arrays_zip) — pault, Mar 04 '20 at 16:35
after using arrays_zip, i tried to sort the tuples but its still not sorted: `df.select(arrays_zip(df.fruits, df.count).alias('zipped')).sort(col("zipped")).show(20,False)` — jxn, Mar 04 '20 at 16:58
For that you can may be able to use [`array_sort`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=explode#pyspark.sql.functions.array_sort) — pault, Mar 04 '20 at 17:00

PySpark merge 2 column values by index into new list

0 Answers0