1

Suppose you have a Dataset A with the following records:

Dataset A:
    {key1, val1}
    {key2, val2}
    {key3, val3}

Dataset B:
    {key4, val4}
    {key1, valBB}
    {key5, valN}
    {key2, NNNNN}

After the "Update" happens this is what the final Dataset should look like:

Dataset Final:
    {key1, valBB}
    {key2, NNNNN}
    {key3, val3}
    {key4, val4}
    {key5, valN}

The approach I have taken thus far, is to convert the two Dataset to a JavaRDD, and then convert the JavaRDD -> JavaPairRDD, and then firstPairRDD.subtractByKey(secondPairRDD). This gives me the records that exist in Dataset A but not in Dataset B. I then reconvert this back to a Dataset. The next step is that i do a Union with DatasetB to give me the updated dataset. for me this isn't quite giving me the result I expected. Did i take the wrong approach? any help would be appreciated.

1 Answers1

2

I ended up finding a more efficient solution:

    Dataset<Row> existsInAButNotB = A.join(B, A.col("key").equalTo(B.col("key") "left_anti");
    Dataset<Row> Final = existsInAButNotB.union(B); 

If you have multiple columns which you use as your key, then your solution should look like this:

Dataset<Row> existsInAButNotB = A.join(B, A.col("key1").equalTo(B.col("key1").and(A.col("key2").equalTo(B.col("key2")) "left_anti");

This one liner avoids the user from going into the inefficient RDD world and avoids adding extra code.

take a look at this:

Left Anti join in Spark?

More on Left Anti Join here:

what is the difference between an anti-join and an anti semi join?

Dataset Join API: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#join(org.apache.spark.sql.Dataset,%20org.apache.spark.sql.Column,%20java.lang.String)