0

I have two giant spark dataframes, one has 50k rows and the other 60k. I am trying to compare each column string with other dataframe column and generate the new dataframe with Remark condition. If the column string from df1 is present in df2 it will be duplicated remark.

df1
colA    colB
A       d4f488bef2d2e25371caecb6a505d69f
B       c8a91953fc52ecdec31ac19c61538aca
C       62026fd921133e434d860591fc03f66a
D       e88480226d3b7e791f6e861c30399fb5
E       8335195031ecfee8f979247c6e7d68cb

df2
ColA    ColB
W       411c78854c9cbcb89a02f53c4b6bca59
X       0bfeb09d6cfb26fc9c618b4cbdfadee6
C       62026fd921133e434d860591fc03f66a
E       8335195031ecfee8f979247c6e7d68cb

Expected output : df3
ColA    ColB                                Remark
A       d4f488bef2d2e25371caecb6a505d69f    old
B       c8a91953fc52ecdec31ac19c61538aca    old
D       e88480226d3b7e791f6e861c30399fb5    old
W       411c78854c9cbcb89a02f53c4b6bca59    new
X       0bfeb09d6cfb26fc9c618b4cbdfadee6    new
C       62026fd921133e434d860591fc03f66a    duplicate
E       8335195031ecfee8f979247c6e7d68cb    duplicate 

0 Answers0