1

I have two DF's(railroadGreaterFile, railroadInputFile).

I want to drop records from railroadGreaterFile if data in MEMBER_NUM column from railroadGreaterFile is matching the data in MEMBER_NUM column from railroadInputFile

Below is what i used:

val columnrailroadInputFile = railroadInputFile.withColumn("check", lit("check"))
val railroadGreaterNotInput = railroadGreaterFile
                               .join(columnrailroadInputFile, Seq("MEMBER_NUM"), "left")
                               .filter($"check".isNull)
                               .drop($"check")

Doing above, records are dropped, however i witnessed railroadGreaterNotInput's schema is combination of my DF1 and DF2 so when I try to write the railroadGreaterNotInput's data to file, it gives me below error

org.apache.spark.sql.AnalysisException: Reference 'GROUP_NUM' is ambiguous, could be: GROUP_NUM#508, GROUP_NUM#72

What should i be doing so that railroadGreaterNotInput would only contain fields from railroadGreaterFile DF?

philantrovert
  • 9,904
  • 3
  • 37
  • 61
  • You can rename the conflicting column names from railroadInputFile and just select railroadGreaterFile dataframe columns only after you join them – Ramesh Maharjan May 03 '18 at 10:57

1 Answers1

2

You can only select the MEMBER_NUM while joining

val columnrailroadInputFile = railroadInputFile.withColumn("check", lit("check"))
val railroadGreaterNotInput = railroadGreaterFile.join(
    columnrailroadInputFile.select("MEMBER_NUM", "check"), Seq("MEMBER_NUM"), "left")
   .filter($"check".isNull).drop($"check")

Or drop all the columns from columnrailroadInputFile as

columnrailroadInputFile.drop(columnrailroadInputFile.columns :_*)

but for this use join contition as

columnrailroadInputFile("MEMBER_NUM") === railroadInputFile("MEMBER_NUM")

Hope this helps!

koiralo
  • 22,594
  • 6
  • 51
  • 72
  • I'm almost there, now i have first column as MEMBER_NUM followed by rest of the columns, is there a way i can swap the first and second columns in railroadGreaterNotInput? – Kiran Kumar May 03 '18 at 11:12
  • I am not sure, what do you mean by swap the first and second column. – koiralo May 03 '18 at 11:14
  • Schema of railroadInputFile is GROUP_NUM, MEMBER_NUM, .... Schema of railroadGreaterFile is GROUP_NUM, MEMBER_NUM, .... The final DF railroadGreaterNotInput's Schema shows MEMBER_NUM, GROUP_NUM.... I want the final DF to be in sync with my DF1 and DF2. – Kiran Kumar May 03 '18 at 11:17
  • Do you wanna swap `GROUP_NUM` and `MEMBER_NUM` and why do you wanna do that ? – koiralo May 03 '18 at 11:19
  • You just need to use select("fields in order you want ") – koiralo May 03 '18 at 11:20
  • Oh yeah, selecting the fields in the order i want will be my last take on this, was just curious if we can change the schema of my final DF altogether. – Kiran Kumar May 03 '18 at 11:22
  • For that Either cast the each column as you require or create an schema that you want to apply and apply it by converting it to rdd. – koiralo May 03 '18 at 11:25