Suppose you have a Dataset A with the following records:
Dataset A:
{key1, val1}
{key2, val2}
{key3, val3}
Dataset B:
{key4, val4}
{key1, valBB}
{key5, valN}
{key2, NNNNN}
After the "Update" happens this is what the final Dataset should look like:
Dataset Final:
{key1, valBB}
{key2, NNNNN}
{key3, val3}
{key4, val4}
{key5, valN}
The approach I have taken thus far, is to convert the two Dataset to a JavaRDD, and then convert the JavaRDD -> JavaPairRDD, and then firstPairRDD.subtractByKey(secondPairRDD). This gives me the records that exist in Dataset A but not in Dataset B. I then reconvert this back to a Dataset. The next step is that i do a Union with DatasetB to give me the updated dataset. for me this isn't quite giving me the result I expected. Did i take the wrong approach? any help would be appreciated.