I have two giant spark dataframes, one has 50k rows and the other 60k. I am trying to compare each column string with other dataframe column and generate the new dataframe with Remark condition. If the column string from df1 is present in df2 it will be duplicated remark.
df1
colA colB
A d4f488bef2d2e25371caecb6a505d69f
B c8a91953fc52ecdec31ac19c61538aca
C 62026fd921133e434d860591fc03f66a
D e88480226d3b7e791f6e861c30399fb5
E 8335195031ecfee8f979247c6e7d68cb
df2
ColA ColB
W 411c78854c9cbcb89a02f53c4b6bca59
X 0bfeb09d6cfb26fc9c618b4cbdfadee6
C 62026fd921133e434d860591fc03f66a
E 8335195031ecfee8f979247c6e7d68cb
Expected output : df3
ColA ColB Remark
A d4f488bef2d2e25371caecb6a505d69f old
B c8a91953fc52ecdec31ac19c61538aca old
D e88480226d3b7e791f6e861c30399fb5 old
W 411c78854c9cbcb89a02f53c4b6bca59 new
X 0bfeb09d6cfb26fc9c618b4cbdfadee6 new
C 62026fd921133e434d860591fc03f66a duplicate
E 8335195031ecfee8f979247c6e7d68cb duplicate