0

I am using spark-sql-2.4.1v. I need to join two datasets ds1 and ds2 (new-column) based on some field. val resultDs = ds1.join(ds2, , "inner");

resultDs now contain "new-column" but for the records which not met does not have this "new-column". Hence for those records I need to set this "new-column" null/empty.

But as per my business requirement I need to parition the resultDs on "new-column". so how generally these kind of scenarios are handled ? pleas advice.

BdEngineer
  • 2,929
  • 4
  • 49
  • 85
  • 1
    You can use a left join instead of inner, that will keep all rows from `df1` and automatically put a null in the columns with no matching row in `df2`. See here for different join types: https://stackoverflow.com/questions/45990633/what-are-the-various-join-types-in-spark As for how to handle the partitioning in this case, is there any default value? If not, what to do would depend on the actual business requirement. – Shaido May 05 '20 at 07:53
  • 1
    Using left join the new column will have nulls automatically. Since you want to partition on the column, you could replace the nulls with a default value (or impute the values with forward-fill, etc) or you could keep them as null. Both should be ok. – Shaido May 05 '20 at 08:56
  • 1
    Forward fill would be something like this: https://stackoverflow.com/questions/33621319/spark-scala-forward-fill-with-last-observation , an alternative would to be to replace with mean values: https://stackoverflow.com/questions/40057563/replace-missing-values-with-mean-spark-dataframe To get rows that differ between two dataframes you can use `except`/`subtract` or leftanti join: https://stackoverflow.com/questions/49270373/how-to-implement-except-in-apache-spark-based-on-subset-of-columns/49270468#49270468 (when done on a subset of the columns). – Shaido May 05 '20 at 09:14

0 Answers0