pyspark - merge 2 columns of sets

Question

I have a spark dataframe that has 2 columns formed from the function collect_set. I would like to combine these 2 columns of sets into 1 column of set. How should I do so? They are both set of strings

For Instance I have 2 columns formed from calling collect_set

Fruits                  |    Meat
[Apple,Orange,Pear]          [Beef, Chicken, Pork]

How do I turn it into:

Food

[Apple,Orange,Pear, Beef, Chicken, Pork]

Thank you very much for your help in advance

Please provide more information like structure of the dataframe with examples — Avishek Bhattacharya, Oct 06 '17 at 16:48

score 9 · Answer 1 · edited Jun 13 '18 at 12:04

I was also figuring this out in Python, so here is a port of Ramesh's solution to Python:

df = spark.createDataFrame([(['Pear','Orange','Apple'], ['Chicken','Pork','Beef'])],
                           ("Fruits", "Meat"))
df.show(1,False)

from pyspark.sql.functions import udf
mergeCols = udf(lambda fruits, meat: fruits + meat)
df.withColumn("Food", mergeCols(col("Fruits"), col("Meat"))).show(1,False)

Output:

+---------------------+---------------------+
|Fruits               |Meat                 |
+---------------------+---------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|
+---------------------+---------------------+
+---------------------+---------------------+------------------------------------------+
|Fruits               |Meat                 |Food                                      |
+---------------------+---------------------+------------------------------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|[Pear, Orange, Apple, Chicken, Pork, Beef]|
+---------------------+---------------------+------------------------------------------+

Kudos to Ramesh!

EDIT: Note that you might have to manually specify the column type (not sure why it worked for me only in some cases without explicit type specification - in other cases I was getting a string type column).

from pyspark.sql.types import *
mergeCols = udf(lambda fruits, meat: fruits + meat, ArrayType(StringType()))

score 2 · Answer 2 · edited Jun 13 '18 at 12:05

Given that you have dataframe as

+---------------------+---------------------+
|Fruits               |Meat                 |
+---------------------+---------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|
+---------------------+---------------------+

You can write a udf function to merge the sets of two columns into one.

import org.apache.spark.sql.functions._
def mergeCols = udf((fruits: mutable.WrappedArray[String], meat: mutable.WrappedArray[String]) => fruits ++ meat)

And then call the udf function as

df.withColumn("Food", mergeCols(col("Fruits"), col("Meat"))).show(false)

You should have your desired final dataframe

+---------------------+---------------------+------------------------------------------+
|Fruits               |Meat                 |Food                                      |
+---------------------+---------------------+------------------------------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|[Pear, Orange, Apple, Chicken, Pork, Beef]|
+---------------------+---------------------+------------------------------------------+

Is this with python? I can't seem to find mutable.WrappedArray — soulless, Oct 08 '17 at 11:41

score 2 · Answer 3 · answered Jul 22 '22 at 05:22

Adding solution here for the definition of a set not containing duplicates. Also avoids any performance concerns with python udfs.

Requires Spark 2.4+

from pyspark.sql import functions as F
df = spark.createDataFrame([(['Chicken','Pork','Beef',"Tuna"], ["Salmon", "Tuna"])],
                           ("Meat", "Fish"))
df.show(1,False)
df_union = df.withColumn("set_union", F.array_distinct(F.array_union("Meat", "Fish")))
df_union.show(1, False)

results in

+---------------------------+--------------+-----------------------------------+
|Meat                       |Fish          |set_union                          |
+---------------------------+--------------+-----------------------------------+
|[Chicken, Pork, Beef, Tuna]|[Salmon, Tuna]|[Chicken, Pork, Beef, Tuna, Salmon]|
+---------------------------+--------------+-----------------------------------+

score 0 · Accepted Answer · edited Jun 13 '18 at 12:04

Let's say df has

+--------------------+--------------------+
|              Fruits|                Meat|
+--------------------+--------------------+
|[Pear, Orange, Ap...|[Chicken, Pork, B...|
+--------------------+--------------------+

then

import itertools
df.rdd.map(lambda x: [item for item in itertools.chain(x.Fruits, x.Meat)]).collect()

creates a set of Fruits & Meat combined into one set i.e.

[[u'Pear', u'Orange', u'Apple', u'Chicken', u'Pork', u'Beef']]

Hope this helps!

pyspark - merge 2 columns of sets

4 Answers4

Linked