I have a DataFrame which contains the following details.
|id|Name|Country|version|
|1 |Jack|UK |new |
|1 |Jack|USA |old |
|2 |Rose|Germany|new |
|3 |Sam |France |old |
I would like to create a DataFrame where, if the data is duplicate based on "id" it picks the new version over the old version as so
|id|Name|Country|version|
|1 |Jack|UK |new |
|2 |Rose|Germany|new |
|3 |Sam |France |old |
What is the best way to do this in Java/Spark, or do I have to use some sort of nested SQL query?
The simplified SQL version would look something like below:
WITH new_version AS (
SELECT
ad.id
,ad.name
,ad.country
,ad.version
FROM allData ad
WHERE ad.version = 'new'
),
old_version AS (
SELECT
ad.id
,ad.name
,ad.country
,ad.version
FROM allData ad
LEF JOIN new_version nv on nv.id = ad.id
WHERE ad.version = 'old'
AND nv.id is null
),
SELECT id, name, country, version FROM new_version
UNION ALL
SELECT id, name, country, version FROM old_version