I'm using spark to load json files from Amazon S3. I would like to remove duplicates based on two columns of the data frame retaining the newest(I have timestamp column). What would be the best way to do it? Please note that the duplicates may be spread across partitions. Can I remove duplicates retaining the last record without shuffling? I'm dealing with 1 TB of data.
I was thinking of partitioning the data frame by those two columns in such way that all duplicate records will be "consistently hashed" into the same partition and thus a partition level sort followed be drop duplicates will eliminate all duplicates keeping just one. I dunno if it's possible. Any information is appreciated.