R duplicated rows still remain after distinct

Question

I am trying to remove duplicated rows in my data frame, but either distinct(d) or filter(duplicated(d)) does not remove the duplicated rows (where d is the data frame name with duplicated rows) -- the functions do not recognize the duplicated rows. Is there any common reason why this happens?

Below is the example dataset using dput.

structure(list(id.case = c("114746", "114746", "114746", "114746", 
"114746", "114746", "114746", "114746", "114746", "114746", "114746", 
"114746", "114746", "114746", "114746", "114746", "114746", "114746", 
"114746", "114746"), id.pair = c("78272-10794", "9330-10794", 
"9330-10794", "80739-42071", "80739-42071", "42114-10794", "42114-10794", 
"84701-42114", "84701-42114", "5533-42071", "5533-42071", "8876-5533", 
"8876-5533", "5652-42114", "5652-42114", "8920-5652", "8920-5652", 
"78272-5533", "78272-5533", "9114-78272"), e1.conditional.dyad = c(1.07224025692901, 
0.568380969299369, 0.568380969302098, 0.252545406662165, 0.252545406663273, 
-1.21808723071715, -1.21808723071797, -4.1477891182987, -4.14778911829956, 
-1.48315629665277, -1.48315629665359, -1.3047217588809, -1.30472175888309, 
-1.63547814316539, -1.63547814316453, -0.671008645771849, -0.671008645772957, 
-0.0801843233972761, -0.0801843233964519, 2.30874742062369)), row.names = c(NA, 
20L), class = "data.frame")

I am trying to use the below code.

d %>% distinct

can you give example of rows which are duplicate in your data above? — Onyambu, Jun 05 '22 at 14:45
@onyambu It turns out that there were no duplicates in the first place because of decimal points. Sorry for the confusion! — J.K., Jun 05 '22 at 16:59

r2evans · Accepted Answer · 2022-06-05T15:15:07.030

Up front: your numbers are not exactly the same, see

d[2:3,]
#   id.case    id.pair e1.conditional.dyad
# 2  114746 9330-10794            0.568381
# 3  114746 9330-10794            0.568381
diff(d[2:3,3])
# [1] 2.729039e-12

Computers have limitations when it comes to floating-point numbers (aka double, numeric, float). This is a fundamental limitation of computers in general, in how they deal with non-integer numbers. This is not specific to any one programming language. There are some add-on libraries or packages that are much better at arbitrary-precision math, but I believe most main-stream languages (this is relative/subjective, I admit) do not use these by default. Refs: Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754

To continue using distinct without losing the actual precision of your values, try

d %>%
  distinct(id.case, id.pair, ign = round(e1.conditional.dyad, 8), .keep_all = TRUE) %>%
  select(-ign)
#    id.case     id.pair e1.conditional.dyad
# 1   114746 78272-10794          1.07224026
# 2   114746  9330-10794          0.56838097
# 3   114746 80739-42071          0.25254541
# 4   114746 42114-10794         -1.21808723
# 5   114746 84701-42114         -4.14778912
# 6   114746  5533-42071         -1.48315630
# 7   114746   8876-5533         -1.30472176
# 8   114746  5652-42114         -1.63547814
# 9   114746   8920-5652         -0.67100865
# 10  114746  78272-5533         -0.08018432
# 11  114746  9114-78272          2.30874742

where the decision to use 8 digits is arbitrary (here) and sensitive to your knowledge of the data.

I really appreciate your explanation and the solution!!!! It is perfect, even the last line of your comment (sensitive to your knowledge of the data). Great learning experience. @r2evans — J.K., Jun 05 '22 at 17:00

score 1 · Answer 2 · answered Jun 05 '22 at 14:56

The problem is that your numeric column doesn't have duplicates because of the many digits. So if you round that column, you can remove the duplicates if you want like this:

d$e1.conditional.dyad <- round(d$e1.conditional.dyad, digits = 4)
d %>% distinct()

Output:

   id.case     id.pair e1.conditional.dyad
1   114746 78272-10794              1.0722
2   114746  9330-10794              0.5684
3   114746 80739-42071              0.2525
4   114746 42114-10794             -1.2181
5   114746 84701-42114             -4.1478
6   114746  5533-42071             -1.4832
7   114746   8876-5533             -1.3047
8   114746  5652-42114             -1.6355
9   114746   8920-5652             -0.6710
10  114746  78272-5533             -0.0802
11  114746  9114-78272              2.3087

Similar to the previous solution, it worked well! Thank you for suggesting that I should provide the sample data (in the previous comment now removed). It helped me to find out that there was no duplicate the first time. @Quinten — J.K., Jun 05 '22 at 17:02

score 1 · Answer 3 · answered Jun 05 '22 at 15:02

Here's one approach (but I'm sure there will be better ones), the trick being that you first collapse the whole dataframe into a single diagnostic helper column, on which you then use the duplicated function:

d %>%
  mutate(diagnost = apply(d, 1, paste0, collapse = "")) %>%
  filter(!duplicated(diagnost)) %>%
  select(-diagnost)
   id.case     id.pair e1.conditional.dyad
1   114746 78272-10794          1.07224026
2   114746  9330-10794          0.56838097
3   114746 80739-42071          0.25254541
4   114746 42114-10794         -1.21808723
5   114746 84701-42114         -4.14778912
6   114746  5533-42071         -1.48315630
7   114746   8876-5533         -1.30472176
8   114746  5652-42114         -1.63547814
9   114746   8920-5652         -0.67100865
10  114746  78272-5533         -0.08018432
11  114746  9114-78272          2.30874742

Try this again after `options(digits=22)` ... unfortunately, it is prone to problems. — r2evans, Jun 05 '22 at 15:09
I will try this when I have time and leave a comment if it works. thank you for taking your time for this solution! @Chris Ruehlemann — J.K., Jun 05 '22 at 17:03

R duplicated rows still remain after distinct

3 Answers3