I would like to drop all records which are duplicate entries but have said a difference in the timestamp could be of any amount of time as an offset but for simplicity will use 2 minutes.
+-------------------+-----+----+
|Date |ColA |ColB|
+-------------------+-----+----+
|2017-07-04 18:50:21|ABC |DEF |
|2017-07-04 18:50:26|ABC |DEF |
|2017-07-04 18:50:21|ABC |KLM |
+-------------------+-----+----+
I would like my dataframe to have only rows
+-------------------+-----+----+
|Date |ColA |ColB|
+-------------------+-----+----+
|2017-07-04 18:50:26|ABC |DEF |
|2017-07-04 18:50:21|ABC |KLM |
+-------------------+-----+----+
I tried something like this but this does not remove duplicates.
val joinedDfNoDuplicates = joinedDFTransformed.as("df1").join(joinedDFTransformed.as("df2"), col("df1.ColA") === col("df2.ColA") &&
col("df1.ColB") === col("df2.ColB") &&
&& abs(unix_timestamp(col("Date")) - unix_timestamp(col("Date"))) > offset
)
For now, I am just selecting distinct or a group by min here Find minimum for a timestamp through Spark groupBy dataframe on the data based on certain columns but I would like a more robust solution the reason for this is that data outside of that interval may be valid data. Also, the offset could be changed so maybe within 5s or 5 minutes depending on requirements.
Somebody mentioned to me about creating a UDF comparing dates and if all other columns are the same but I am not sure exactly how to do that such that either I would filter out rows or add a flag and then remove those rows any help would be greatly appreciated.
Similiar sql question here Duplicate entries with different timestamp
Thanks!