R distinct() does not take out duplicates

Question

I have been battling with this for a while now. As part of a large for-loop, want to take out some data points to be able to create concave hull around the resulting points (needs a minimum of 4 points). For this I have a line which makes sure that clusters where x or y values are ALL the same value are removed, as well as clusters with less than 4 lines. However, it can also happen that some points (not all) within a cluster are duplicates, causing the cluster to have >=4 lines, but the actual points are not >=4. To take out these duplicates I use distinct(), but sometimes this fails to take out the duplicates, as with the example data frame below. Any idea how to effectively take out these duplicates?

Example data

SP_occ <- structure(list(x = c(-28.212197, -130.758, -15, 47.549999, -29.346937, 
-27.794644, -124.8, 47.416698, 47.75, -15.566667, 178.73, -29.344852, 
175.432999, 47.75, 87, -10, 55.666668, 46.533, 47, 114.75, -29.356563, 
87, 46, -128.296, -9, 154.21667, 47.549999, 47.549999, 87, -72.133301, 
-157.89167, -23.055, 87, 46.366665, 55.45, 122.932999, -28.991, 
153.216995, -29.35066, -29.122, 47.75, 123.967003, 121.5, 27.4167, 
-27.96666, 47.266701, 87, 87, 47.583302, 114.75, -26.610647, 
-26.589459, -10, 87, 122.949997, 47.583302, 125.400002, -15.533334, 
-25.239904, 45.533, -28.295, 47.416698, 46, 52.0833, 87, 172.932999, 
47.75, 5.4629, 121.667, 27.4167, -29.344852, -29.346937, -29.356563, 
-9.387, -28.212197, -27.794644, 154.216667, -28.991, -28.991, 
-29.35066, -25.239904, -26.610647, -26.589459, -27.96666, -15, 
87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 52.0833, 45.533, 
46.533, 114.75, -10, -15.533333, -15.566667, 178.73, -9.5, -9.466667, 
-9.466667, -9.466667, -9.466667, -9.466667, -9.466667, -8.916667, 
-8.916667, -9.083333, 152.756836, 138.74492, -9.321667, 5.4629, 
139.416667, 55.666668), y = c(38.659904, -23.931, 55, -38.366699, 
38.681605, 39.000465, -24.68, -38.349998, -38.650002, 28.183332, 
-38.65, 38.68313, -28.1833, -38.650002, -27, 46, -4.582778, -39.033, 
-9, -35, 38.671144, -27, -12, -24.328, 56, -20.85, -38.366699, 
-38.9333, -27, 40.966702, 21.391684, 16.5667, -27, -9.416667, 
-4.766666, 24.5, 42.497, -20.85, 37.997214, 42.432, -38.583302, 
24.0667, -11, -33.3167, 38.962846, -38.950001, -27, -27, -38.966702, 
-35, 40.341647, 40.357008, 46, -27, 24.299999, -38.966702, 24.5833, 
28.266666, 37.900563, -40.416, 29.891666, -38.349998, -9, -36.5833, 
-27, -28.5667, -38.583302, -26.1297, -11, -33.3167, 38.68313, 
38.681605, 38.671144, 57.245, 38.659904, 39.000465, -20.85, 42.497, 
42.497, 37.997214, 37.900563, 40.341647, 40.357008, 38.962846, 
55, -27, -27, -27, -27, -27, -27, -27, -27, -27, -27, -27, -36.5833, 
-40.416, -39.033, -35, 46, 28.266667, 28.183333, -38.65, 55.733333, 
55.666667, 55.666667, 55.666667, 55.666667, 55.666667, 55.666667, 
58.583333, 58.583333, 56.691667, -33.054223, 34.908889, 38.285, 
-26.1297, 35.25, -4.582778), cluster = c(1L, 2L, 3L, 4L, 5L, 
1L, 6L, 4L, 4L, 7L, 8L, 5L, 9L, 4L, 10L, 11L, 12L, 13L, 14L, 
15L, 5L, 10L, 16L, 17L, 18L, 19L, 4L, 4L, 10L, 20L, 21L, 22L, 
10L, 23L, 12L, 24L, 25L, 26L, 27L, 25L, 4L, 28L, 29L, 30L, 1L, 
4L, 10L, 10L, 4L, 15L, 31L, 31L, 11L, 10L, 24L, 4L, 32L, 7L, 
33L, 34L, 35L, 4L, 36L, 37L, 10L, 38L, 4L, 39L, 29L, 30L, 5L, 
5L, 5L, 40L, 1L, 1L, 19L, 25L, 25L, 27L, 33L, 31L, 31L, 1L, 3L, 
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 37L, 34L, 
13L, 15L, 11L, 7L, 7L, 8L, 41L, 41L, 41L, 41L, 41L, 41L, 41L, 
42L, 42L, 43L, 44L, 45L, 46L, 39L, 47L, 12L)), row.names = c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 
16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 
29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 
42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 50L, 51L, 52L, 53L, 54L, 
55L, 56L, 57L, 58L, 59L, 60L, 61L, 62L, 63L, 64L, 65L, 66L, 67L, 
68L, 69L, 70L, 74L, 75L, 76L, 77L, 78L, 79L, 80L, 81L, 82L, 83L, 
84L, 85L, 86L, 87L, 88L, 89L, 90L, 91L, 92L, 93L, 94L, 95L, 96L, 
97L, 98L, 99L, 100L, 101L, 103L, 105L, 106L, 107L, 108L, 109L, 
111L, 112L, 113L, 114L, 115L, 116L, 117L, 118L, 119L, 120L, 123L, 
125L, 126L, 135L, 136L, 141L), class = "data.frame")

Code

SP_occ  <- SP_occ %>% distinct()
SP_occ  <- SP_occ %>% group_by(cluster) %>% filter(!(n_distinct(round(x, 6)) == 1 || n_distinct(round(y, 6)) == 1) && n() >= 4)
SP_occ  <- SP_occ[SP_occ$cluster != 0,]
SP_occ$Cluster <- SP_occ %>% group_indices(cluster)
SP_occ         <- SP_occ[, c(1,2,4)]

Generate an id vector,for example, something along the lines' 'dataset$id <- paste0(x, cluster, group)'''. Afterward, estimate how many times ids are repeated dataset "" dataset <- dataset %<% mutate(duplicates = n(id)). Afterwards, inspect and delete those observations with n higher than 2 — Adrian del rio rodriguez, May 01 '20 at 14:22

score 0 · Answer 1 · answered May 01 '20 at 14:27

0

Could you explain which records in your example are the problem you are referring to? After using distinct() there are no remaining exact duplicates in your data. If you want to remove records that are 'almost' identical (very small numerical differences) you could consider doing

SP_occ  <- SP_occ %>% 
  mutate(x = round(x,5),
         y = round(y,5)) %>% 
  distinct()

answered May 01 '20 at 14:27

pieterbons

1,604
1
11
14

Thanks for the tip. See my additional answer for the output that I get from the code. – Shark167 May 01 '20 at 14:35
As you can see in your own input (search this page for '-15.53333') the values are actually not exactly same - one is -15.533333 and the other is -15.533334. – pieterbons May 01 '20 at 14:44
Yes, I see that, but what causes it to round on 5 decimals? – Shark167 May 01 '20 at 14:48
This is just the way it is presented by R. The value itself is not actually rounded, but floating numbers are not printed with all their decimals because it would clutter the screen. – pieterbons May 01 '20 at 14:50
Try setting options(digits=10) – pieterbons May 01 '20 at 14:51
In any case, the records are not identical (you an verify this yourself by exporting to a csv and opening with a text editor). If you want to control the precision with which numbers are printed in R, you could check out this question: https://stackoverflow.com/questions/2287616/controlling-number-of-decimal-digits-in-print-output-in-r. – pieterbons May 01 '20 at 15:03

score 0 · Answer 2 · answered May 01 '20 at 14:34

The result that I get is the DF below. Cluster 2 is made up by 4 points, of which 2 are actually unique.

    x           y           Cluster
1   47.55000    -38.36670   1
2   47.41670    -38.35000   1
3   47.75000    -38.65000   1
4   -15.56667   28.18333    2
5   47.55000    -38.93330   1
6   47.75000    -38.58330   1
7   47.26670    -38.95000   1
8   47.58330    -38.96670   1
9   -15.53333   28.26667    2
10  -15.53333   28.26667    2
11  -15.56667   28.18333    2

R distinct() does not take out duplicates

2 Answers2