1

I have a dataframe named "A" with 300+ columns in it and i am trying to join the dataframe named "A" with its incremental data "B" with same schema as "A".

After joining the dataframes, i am getting duplicate columns. That i was avoiding by using co

val toPrint = udf((value1: String, value2: String) => if(value1 != null) {value1} else value2)
val dfClean = df1.join(df2, df1("PERIOD_TAG") === df2("PERIOD_TAG"), "fullouter").select(toPrint(df1("PERIOD_SHORT_DESCRIPTION"),df2("PERIOD_SHORT_DESCRIPTION")).alias("PERIOD_SHORT_DESCRIPTION"),toPrint(df1("PERIOD_TAG"),df2("PERIOD_TAG")).alias("PERIOD_TAG"))....so on for all the columns

I am calling a UDF to select the most updated value(from incremental file) among the duplicate columns. The incremental data will have few updated data which i need to add along with all new data in incremantal dataframe and also old data of dataframe "B".

Is there any another way to avoid selecting columns individually and use a for loop for it. Or is there any way that after joining, i get the new/updated value of my incremental df and old values of dataframe "B" which are not present in dataframe "A".

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Ankan
  • 41
  • 6
  • Its very similar to the mentioned problem , but it doesn't suggests how to over come the select problem when no,of columns are more, in my case 300+. I want to give join condition & sequence too. – Ankan Jun 05 '17 at 15:47
  • This is certainly **not** a duplicate since the number of columns exceeds what one would like to write by hand. What's more important, the case is to pick right or left value when left is null or not, respectively. I'm sure it's more general case than what was given as the duplicate. – Jacek Laskowski Jun 05 '17 at 16:51

1 Answers1

0

I'd first avoid the duplication in join column names using single-string usingColumn argument of join operator.

df1.join(df2, "PERIOD_TAG", "fullouter")

That takes care of de-duplicating PERIOD_TAG column.

Different from other join functions, the join column will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.

The last step is to use coalesce function:

coalesce(e: Column*): Column Returns the first column that is not null, or null if all inputs are null.

That looks like your case exactly and avoids dealing with 300+ columns.

val myCol = coalesce($"df1.one", $"df2.one") as "one"
df1.join(df2, "PERIOD_TAG", "inner").
  select(myCol).
  show

So, the exercise is to build myCol-like sequence of columns using coalesce function for every column in the schema (which looks like a fairly easy programming assignment :))

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
  • df1.join(df2, "PERIOD_TAG").select(df1("*")).show will give me all the new rows which are conflicting with old one, even if i union on the old dataframe, i will still have duplicate data in columns. – Ankan Jun 05 '17 at 16:21
  • In principle, i am joining the whole table along with duplicate data and then separating the common repeated values. `val dfClean01 = df1.join(df3, Seq("PRDC_KEY"), "left" ).select(df1("*")) val dfClean02 = df3.join(df2, Seq("PRDC_KEY"), "left" ).select(df3("*")) val dfClean04= dfClean01.unionAll(dfClean02) //This joins the whole columns with duplicate data val dfClean03 = df2.join(df3, Seq("PRDC_KEY")).select(df2("*")) //Seperates the duplicate old data val finalJoin = dfClean04.except(dfClean03) //A-B` – Ankan Jun 06 '17 at 15:39
  • Thanks for the solution. – Ankan Jun 06 '17 at 15:39