I am working with a large dataset of event logs that looks something like this:
time | user_id | place | key | version |
---|---|---|---|---|
2023-02-13 06:28:54 | 30375 | School | 422i-dmank-ev2eia | 2.023 |
2023-02-13 06:24:42 | 47127 | School | wjes-wtpi-byt2rl0 | 2.023 |
2023-02-13 06:18:14 | 67491 | Work | 8kg7-too6-ihyqshh | 2.023 |
2023-02-13 06:03:10 | 36870 | Home | 9xbs-p5hy-envlb8h | 2.022 |
2023-02-13 05:58:24 | 14222 | School | 0z3k-ya93-fcleo2f | 2.022 |
Where:
- time is a POSIXct var
- user_id and version are num vars
- place and key are chr vars
There should be no identical rows in the dataset, so as part of the the cleaning script I am using unique() to catch and delete duplicate rows. This is deleting some duplicates but not all. It appears to be the same duplicate rows that are missed every time.
What could be causing some duplicate rows not to be removed?
My investigation has been limited because I am very new to R and am not sure where to look for the problem, but I have tried the following:
- Switched unique() to distinct()
Result was the same, with the same duplicates being kept.
- Checked that a specific pairs of duplicate rows were actually identical
I guessed that there may be something 'invisible' that meant the kept rows were actually unique, so checked a handful of the non-removed duplicates with a manual-ish process:
time | user_id | place | key | version | |
---|---|---|---|---|---|
x | 2023-02-13 06:28:54 | 30375 | School | 422i-dmank-ev2eia | 2.023 |
y | 2023-02-13 06:28:54 | 30375 | School | 422i-dmank-ev2eia | 2.023 |
check <- dataframe[x,] == dataframe[y,]
Which always returned [TRUE, TRUE, TRUE, TRUE, TRUE]
So it appears that cell difference is not what's causing the rows to not be considered duplicates.