We receive daily files from external system and store it into Hive. Want to enable versioning on data.
col1, col2 is composite key so if we receive same combination of data from file then it should be stored into Hive with new version. Latest data that comes from file should get the biggest version number. How could we do this in spark
file df
+----+----+-----+-------------------+-------+
||col1 |col2|value| ts |version|
+----+----+-----+-------------------+-------+
| A| B| 777|2019-01-01 00:00:00| 1|
| K| D| 228|2019-01-01 00:00:00| 1|
| G| G| 241|2019-01-01 00:00:00| 1|
+----+----+-----+-------------------+-------+
Don't receive version from external system but if we need it for comparison then it will be always 1
hive df
+----+----+-----+-------------------+-------+
||col1 |col2|value| ts |version|
+----+----+-----+-------------------+-------+
| A| B| 999|2018-01-01 00:00:00| 1|
| A| B| 888|2018-01-02 00:00:00| 2|
| B| C| 133|2018-01-03 00:00:00| 1|
| G| G| 231|2018-01-01 00:00:00| 1|
+----+----+-----+-------------------+-------+
After merge
+----+----+-----+-------------------+-----------+
|col1|col2|value| ts |new_version|
+----+----+-----+-------------------+-----------+
| B| C| 133|2018-01-03 00:00:00| 1|
| K| D| 228|2019-01-01 00:00:00| 1|
| A| B| 999|2018-01-01 00:00:00| 1|
| A| B| 888|2018-01-02 00:00:00| 2|
| A| B| 777|2019-01-01 00:00:00| 3|
| G| G| 231|2018-01-01 00:00:00| 1|
| G| G| 241|2019-01-01 00:00:00| 2|
+----+----+-----+-------------------+-----------+