Pyspark - concatenate two dataframes based on one field

Question

I have two dataframes: The data is too large to use pandas, so I need to do this in Pyspark.

df1:

 col 1 | col 2 | col 3|
 12345 | asd   | zxc  |
 12345 | qwe   | dfg  |
 12345 | ert   | fgh  |

df2:

 col 1 | col 2 | col 3|
 54321 | asd   | poi  |
 54321 | qwe   | lkj  |
 54321 | ert   | mnb  |
 54321 | ytuyeye | jfg|

I want to concatenate these dataframes with the same columns so the width of the table doesn't expand and I'm only taking data in df2 that has a matching col 2. In the above example, I would expect the result to have 6 rows.

In other words, df1 is a snapshot and df2 is a snapshot at a later date where I only want to join in records that are present in df1. I've tried join and union and haven't had any luck. Thanks in advance

I got the answer:

df2_x = df2.alias('b').join(df1.alias('a'), ['col2'], "inner")\
                                             .select("*")
df3 = df1.union(df2_matches)

score 1 · Answer 1 · answered Mar 15 '22 at 14:56

1

Imports:

from pyspark.sql import functions as F, Window as W, types as T

Method 1, using join and union:

out1 = df1.unionByName(df1.alias("df1")
          .join(df2.alias("df2"),on='col 2',how='inner').select("df2.*"))

Method 2 using union and then a window to count partitioned over col 2 and only keep if count>1

w = W.partitionBy("col 2")
out2 = df1.unionByName(df2).withColumn("Counts",F.count("col 2").over(w))\
.filter("Counts>1").drop("Counts")

Lets look at the Physical plans:

out1.explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Union
   :- Scan ExistingRDD[col 1#173L,col 2#174,col 3#175]
   +- Project [col 1#192L, col 2#193, col 3#194]
      +- SortMergeJoin [col 2#426], [col 2#193], Inner
         :- Sort [col 2#426 ASC NULLS FIRST], false, 0
         :  +- Exchange hashpartitioning(col 2#426, 200), ENSURE_REQUIREMENTS, [id=#1557]
         :     +- Project [col 2#426]
         :        +- Filter isnotnull(col 2#426)
         :           +- Scan ExistingRDD[col 1#425L,col 2#426,col 3#427]
         +- Sort [col 2#193 ASC NULLS FIRST], false, 0
            +- Exchange hashpartitioning(col 2#193, 200), ENSURE_REQUIREMENTS, [id=#1558]
               +- Filter isnotnull(col 2#193)
                  +- Scan ExistingRDD[col 1#192L,col 2#193,col 3#194]

out2.explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [col 1#173L, col 2#174, col 3#175]
   +- Filter (Counts#451L > 1)
      +- Window [count(col 2#174) windowspecdefinition(col 2#174, col 2#174 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS Counts#451L], [col 2#174], [col 2#174 ASC NULLS FIRST]
         +- Sort [col 2#174 ASC NULLS FIRST, col 2#174 ASC NULLS FIRST], false, 0
            +- Exchange hashpartitioning(col 2#174, 200), ENSURE_REQUIREMENTS, [id=#1588]
               +- Union
                  :- Scan ExistingRDD[col 1#173L,col 2#174,col 3#175]
                  +- Scan ExistingRDD[col 1#192L,col 2#193,col 3#194]

answered Mar 15 '22 at 14:56

anky

74,114
11
41
70

Unfortunately neither of these options worked. The first threw an error that the column name could not be found in the data (it is there) and the second returned a blank dataframe. – goldpanda1019 Mar 15 '22 at 15:10
@pand19 This means there is a difference between column names of both the dataframes, what is the output of `set(df1.columns).intersection(df2.columns)`, if this returns blank, you might carefully want to check the column names. ideally the length of intersection should be same as length of df1.columns. Alternatively you can also check `set(df1.columns).difference(df2.columns)` and you will find this returning values of columns which doesnot match – anky Mar 15 '22 at 15:12
the column names are sourced from the same base table. Running that line correctly returns all of the columns in my dataframes. In the first option, I tried substituting a different appropriate variable that the error said was in the table. After running, it threw the same error for the new variable. The difference line, returns a blank set. – goldpanda1019 Mar 15 '22 at 15:14
@pand19 this works for me on your sample datasets. can you check if this works for you for the sample data? – anky Mar 15 '22 at 15:16
AnalysisException: 'Cannot resolve column name "col2" among (col1, col3);' same error. (I removed the spaces in the column names for this exercise) – goldpanda1019 Mar 15 '22 at 15:27
@pand19 its an issue with column names in your daataframe v/s the code which you need to figure, the solution should run fine once you have figured that and the variable which you are using in your original usecase.... – anky Mar 15 '22 at 15:30
I am certain my column names match. df1 = spark.createDataFrame( [ (12345, "foo", 123), (12345, "bar", 345), (12345, "bat", 567), ], ["col1", "col2", "col3"] ) df2 = spark.createDataFrame( [ (54321, "foo", 543), (54321, "bar", 980), (54321, "bat", 876), (54321, "mat", 462), ], ["col1", "col2", "col3"] ) – goldpanda1019 Mar 15 '22 at 15:34
@pand19 well in that case, there must be other elements in your actual code which I am missing. Sorry but I tested again and this works for your original post. – anky Mar 15 '22 at 15:36

Pyspark - concatenate two dataframes based on one field

1 Answers1