Find duplicate values and have references

Question

My data = data.lab

data.lab <- data.frame(Name=c("A","e","b","c","d"),
                 bp =c( 12,12,11,12,11),
           sugar = c(19,21,23,19,23))

I want to have only duplicate names with the reference

desired output

lab.data <- data.frame(Name=c("A","b","c","d"),
                     bp =c( 12,11,12,11),
               sugar = c(19,23,19,23),
               pair=c(1,1,2,2))




dub.data <- duplicated(data.lab) | duplicated(data.lab, fromLast = TRUE) 
 out.1=data.lab[dub.data, ]

this gives the duplicate data but i need a column as what are the duplicate pairs

Can you please provide your example in a reproducible manner? If you need advice on how to do that, see [this question](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Also, you are probably looking for `which`. — Roman Luštrik, Apr 18 '19 at 05:35

tmfmnk · Accepted Answer · 2019-04-23T10:35:12.260

2

With dplyr, you can do:

data.lab %>%
 group_by(bp, sugar) %>%
 filter(n() == 2) %>%
 mutate(pair = seq_along(Name))

  Name     bp sugar  pair
  <fct> <dbl> <dbl> <int>
1 A        12    19     1
2 b        11    23     1
3 c        12    19     2
4 d        11    23     2

Or:

data.lab %>%
 group_by(bp, sugar) %>%
 filter(n() == 2) %>%
 mutate(pair = row_number())

Or if there could be more than two pairs of duplicates:

data.lab %>%
 group_by(bp, sugar) %>%
 filter(n() > 1) %>%
 mutate(pair = seq_along(Name))

Or:

data.lab %>%
 group_by(bp, sugar) %>%
 filter(n() > 1) %>%
 mutate(pair = row_number())

Or to group by all variables except of "Name":

data.lab %>%
 group_by_at(vars(-matches("(Name)"))) %>%
 filter(n() > 1) %>%
 mutate(pair = seq_along(Name))

Or:

data.lab %>%
 group_by_at(vars(-matches("(Name)"))) %>%
 filter(n() > 1) %>%
 mutate(pair = row_number())

edited Apr 23 '19 at 10:35

answered Apr 18 '19 at 06:19

tmfmnk

38,881
4
47
67

2

@Ronak Shah I usually consider pairs to have exactly two cases, but yes, you are right, the OP maybe wants also more than two pairs of duplicates. Updated the post, thanks :) – tmfmnk Apr 18 '19 at 07:30
is there any way i can provide all column names instead of mentioning as bp, sugar? – Allabux Jaffer Apr 23 '19 at 06:52
Yes, there is a way to group by all of the variables except of "Name". Added it to the post. – tmfmnk Apr 23 '19 at 10:34

Ronak Shah · Answer 2 · 2019-04-23T07:03:39.613

1

Continuing from your approach , we can use ave in base R

dat1 <- data.lab[duplicated(data.lab[c("bp", "sugar")]) | 
                 duplicated(data.lab[c("bp", "sugar")], fromLast = TRUE) , ]

dat1$pair <- with(dat1, ave(Name, bp, sugar, FUN = seq_along))
dat1

#  Name bp sugar pair
#1    A 12    19    1
#2    b 11    23    1
#3    c 12    19    2
#4    d 11    23    2

edited Apr 23 '19 at 07:03

answered Apr 18 '19 at 06:32

Ronak Shah

377,200
20
156
213

its giving the below error Error in `$<-.data.frame`(`*tmp*`, "pair", value = c(NA_integer_, NA_integer_, : replacement has 4 rows, data has 5 In addition: Warning messages: 1: In `[<-.factor`(`*tmp*`, i, value = 1:2) : invalid factor level, NA generated 2: In `[<-.factor`(`*tmp*`, i, value = 1:2) : invalid factor level, NA generated – Allabux Jaffer Apr 23 '19 at 06:54
@AllabuxJaffer That is because you have `Name` column as factors. Can you change it to character by doing `data.lab$Name <- as.character(data.lab$Name)` . and then run the code? – Ronak Shah Apr 23 '19 at 06:59
now only this error is coming..Error in `$<-.data.frame`(`*tmp*`, "pair", value = c("1", "1", "2", "2" : replacement has 4 rows, data has 5 – Allabux Jaffer Apr 23 '19 at 07:01
@AllabuxJaffer sorry, we need to subset first. I have updated the answer. Can you check now? – Ronak Shah Apr 23 '19 at 07:04

Find duplicate values and have references

2 Answers2