I have two dataframes: The data is too large to use pandas, so I need to do this in Pyspark.
df1:
col 1 | col 2 | col 3|
12345 | asd | zxc |
12345 | qwe | dfg |
12345 | ert | fgh |
df2:
col 1 | col 2 | col 3|
54321 | asd | poi |
54321 | qwe | lkj |
54321 | ert | mnb |
54321 | ytuyeye | jfg|
I want to concatenate these dataframes with the same columns so the width of the table doesn't expand and I'm only taking data in df2 that has a matching col 2. In the above example, I would expect the result to have 6 rows.
In other words, df1 is a snapshot and df2 is a snapshot at a later date where I only want to join in records that are present in df1. I've tried join and union and haven't had any luck. Thanks in advance
I got the answer:
df2_x = df2.alias('b').join(df1.alias('a'), ['col2'], "inner")\
.select("*")
df3 = df1.union(df2_matches)