Got multiple data files that belong to different weeks - all files of same format. I need to consolidate the files using scala code that runs on Spark. End result should be only unique records by key, also end result should keep the record from latest file for same key fields.
Each data file can potentially have close to 1/2 Billion records and hence the code has to be high performing one...
Example:
Latest data file
CID PID Metric
C1 P1 10
C2 P1 20
C2 P2 30
Previous data File
CID PID Metric
C1 P1 20
C2 P1 30
C3 P1 40
C3 P2 50
Oldest data File
CID PID Metric
C1 P1 30
C2 P1 40
C3 P1 50
C3 P2 60
C4 P1 30
Output file expectation
C1 P1 10
C2 P1 20
C2 P2 30
C3 P1 40
C3 P2 50
C4 P1 30