How Do you Update a Dataset with records from another Dataset which have Identical Schema in Spark with JAVA API?

Question

Suppose you have a Dataset A with the following records:

Dataset A:
    {key1, val1}
    {key2, val2}
    {key3, val3}

Dataset B:
    {key4, val4}
    {key1, valBB}
    {key5, valN}
    {key2, NNNNN}

After the "Update" happens this is what the final Dataset should look like:

Dataset Final:
    {key1, valBB}
    {key2, NNNNN}
    {key3, val3}
    {key4, val4}
    {key5, valN}

The approach I have taken thus far, is to convert the two Dataset to a JavaRDD, and then convert the JavaRDD -> JavaPairRDD, and then firstPairRDD.subtractByKey(secondPairRDD). This gives me the records that exist in Dataset A but not in Dataset B. I then reconvert this back to a Dataset. The next step is that i do a Union with DatasetB to give me the updated dataset. for me this isn't quite giving me the result I expected. Did i take the wrong approach? any help would be appreciated.

Rifat Chowdhury · Accepted Answer · 2017-06-08T03:38:03.310

I ended up finding a more efficient solution:

    Dataset<Row> existsInAButNotB = A.join(B, A.col("key").equalTo(B.col("key") "left_anti");
    Dataset<Row> Final = existsInAButNotB.union(B);

If you have multiple columns which you use as your key, then your solution should look like this:

Dataset<Row> existsInAButNotB = A.join(B, A.col("key1").equalTo(B.col("key1").and(A.col("key2").equalTo(B.col("key2")) "left_anti");

This one liner avoids the user from going into the inefficient RDD world and avoids adding extra code.

take a look at this:

Left Anti join in Spark?

How Do you Update a Dataset with records from another Dataset which have Identical Schema in Spark with JAVA API?

1 Answers1