Scala - Drop records from DF1 if it has matching data with column from DF2

Question

I have two DF's(railroadGreaterFile, railroadInputFile).

I want to drop records from railroadGreaterFile if data in MEMBER_NUM column from railroadGreaterFile is matching the data in MEMBER_NUM column from railroadInputFile

Below is what i used:

val columnrailroadInputFile = railroadInputFile.withColumn("check", lit("check"))
val railroadGreaterNotInput = railroadGreaterFile
                               .join(columnrailroadInputFile, Seq("MEMBER_NUM"), "left")
                               .filter($"check".isNull)
                               .drop($"check")

Doing above, records are dropped, however i witnessed railroadGreaterNotInput's schema is combination of my DF1 and DF2 so when I try to write the railroadGreaterNotInput's data to file, it gives me below error

org.apache.spark.sql.AnalysisException: Reference 'GROUP_NUM' is ambiguous, could be: GROUP_NUM#508, GROUP_NUM#72

What should i be doing so that railroadGreaterNotInput would only contain fields from railroadGreaterFile DF?

You can rename the conflicting column names from railroadInputFile and just select railroadGreaterFile dataframe columns only after you join them — Ramesh Maharjan, May 03 '18 at 10:57

koiralo · Accepted Answer · 2018-05-03T11:06:22.500

2

You can only select the MEMBER_NUM while joining

val columnrailroadInputFile = railroadInputFile.withColumn("check", lit("check"))
val railroadGreaterNotInput = railroadGreaterFile.join(
    columnrailroadInputFile.select("MEMBER_NUM", "check"), Seq("MEMBER_NUM"), "left")
   .filter($"check".isNull).drop($"check")

Or drop all the columns from columnrailroadInputFile as

columnrailroadInputFile.drop(columnrailroadInputFile.columns :_*)

but for this use join contition as

columnrailroadInputFile("MEMBER_NUM") === railroadInputFile("MEMBER_NUM")

Hope this helps!

edited May 03 '18 at 11:06

answered May 03 '18 at 10:58

koiralo

22,594
6
51
72

I'm almost there, now i have first column as MEMBER_NUM followed by rest of the columns, is there a way i can swap the first and second columns in railroadGreaterNotInput? – Kiran Kumar May 03 '18 at 11:12
I am not sure, what do you mean by swap the first and second column. – koiralo May 03 '18 at 11:14
Schema of railroadInputFile is GROUP_NUM, MEMBER_NUM, .... Schema of railroadGreaterFile is GROUP_NUM, MEMBER_NUM, .... The final DF railroadGreaterNotInput's Schema shows MEMBER_NUM, GROUP_NUM.... I want the final DF to be in sync with my DF1 and DF2. – Kiran Kumar May 03 '18 at 11:17
Do you wanna swap `GROUP_NUM` and `MEMBER_NUM` and why do you wanna do that ? – koiralo May 03 '18 at 11:19
You just need to use select("fields in order you want ") – koiralo May 03 '18 at 11:20
Oh yeah, selecting the fields in the order i want will be my last take on this, was just curious if we can change the schema of my final DF altogether. – Kiran Kumar May 03 '18 at 11:22
For that Either cast the each column as you require or create an schema that you want to apply and apply it by converting it to rdd. – koiralo May 03 '18 at 11:25

Scala - Drop records from DF1 if it has matching data with column from DF2

1 Answers1