1

I am working with a large dataset of event logs that looks something like this:

time user_id place key version
2023-02-13 06:28:54 30375 School 422i-dmank-ev2eia 2.023
2023-02-13 06:24:42 47127 School wjes-wtpi-byt2rl0 2.023
2023-02-13 06:18:14 67491 Work 8kg7-too6-ihyqshh 2.023
2023-02-13 06:03:10 36870 Home 9xbs-p5hy-envlb8h 2.022
2023-02-13 05:58:24 14222 School 0z3k-ya93-fcleo2f 2.022

Where:

  • time is a POSIXct var
  • user_id and version are num vars
  • place and key are chr vars

There should be no identical rows in the dataset, so as part of the the cleaning script I am using unique() to catch and delete duplicate rows. This is deleting some duplicates but not all. It appears to be the same duplicate rows that are missed every time.

What could be causing some duplicate rows not to be removed?


My investigation has been limited because I am very new to R and am not sure where to look for the problem, but I have tried the following:

  • Switched unique() to distinct()

Result was the same, with the same duplicates being kept.

  • Checked that a specific pairs of duplicate rows were actually identical

I guessed that there may be something 'invisible' that meant the kept rows were actually unique, so checked a handful of the non-removed duplicates with a manual-ish process:

time user_id place key version
x 2023-02-13 06:28:54 30375 School 422i-dmank-ev2eia 2.023
y 2023-02-13 06:28:54 30375 School 422i-dmank-ev2eia 2.023

check <- dataframe[x,] == dataframe[y,]

Which always returned [TRUE, TRUE, TRUE, TRUE, TRUE]

So it appears that cell difference is not what's causing the rows to not be considered duplicates.

gribbling
  • 11
  • 2
  • 1
    Welcome to Stack Overflow! You need to provide a [minimal, reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) of your data. We need your structure of the data to investigate this. You don't need to share all of your data, but the part that you are sharing formatted as table, should be shared using `dput(sample dataset name)`. Taking the [Tour](https://stackoverflow.com/tour) and reading [How to Ask](https://stackoverflow.com/help/how-to-ask) can be helpful as well. – M-- Feb 27 '23 at 07:35
  • Thanks for the feedback! I appreciate the advice - I had misunderstood what was meant by the minimal reproducible data and had missed the r-specific post. Conveniently, going to create the MRE also meant I found my problem - I hadn't properly saved the dataframe after calling for unique. Nothing wrong with unique(). – gribbling Feb 27 '23 at 07:58
  • One thing you can try is to use `unique()` on each column of the quarantined "seemingly identical" set and see if one column in particular is preventing the rows from being seen as identical. – Noah Feb 27 '23 at 08:01

1 Answers1

0

Answering my own question - my issue turned out not to be with unique(), but functions.

The unique() call was made as part of a function. I had not properly overwritten the original dataframe with the returned, unique dataframe at the end of it.

Preparing the MRE meant that I saved the dataset and ran unique() on it outside of this function, which did work and indicated a different culprit. Thanks for your help!

gribbling
  • 11
  • 2