Assign NA to duplicate values within dataframe prior to using reshape

Question

I want to assign NA to a duplicate value prior to using reshape in order to avoid duplicates in my wide dataset after reshaping. In the example data frame below, I would like to assign NA to all duplicate values in X1 and X2, but not X3, for each ID in my dataset. This means for ID=3, NA should be assigned to X2 in row 4, and for ID=4 this applies to X1 for row 6 and 8, and to x2 also for row 6 and 8. Values of X3 should remain. I want to assign NA since all rows should remain in the data frame.

df <- read.table(header=TRUE,text = 
"ID X1  X2  X3
1   A   X   23
2   B   Y   4
3   A   X   32
3   B   X   6
4   A   Y   45
4   B   Y   7
4   A   Z   5
4   B   Z   3
         ")

  ID X1 X2 X3
1  1  A  X 23
2  2  B  Y  4
3  3  A  X 32
4  3  B  X  6
5  4  A  Y 45
6  4  B  Y  7
7  4  A  Z  5
8  4  B  Z  3

Gregor Thomas · Accepted Answer · 2016-07-29T16:15:33.907

4

duplicated() is useful for identifying duplicates.

df[duplicated(df[c("ID", "X1")]), "X1"] = NA
df[duplicated(df[c("ID", "X2")]), "X2"] = NA

df
#   ID   X1   X2 X3
# 1  1    A    X 23
# 2  2    B    Y  4
# 3  3    A    X 32
# 4  3    B <NA>  6
# 5  4    A    Y 45
# 6  4    B <NA>  7
# 7  4 <NA>    Z  5
# 8  4 <NA> <NA>  3

edited Jul 29 '16 at 16:15

answered Jul 29 '16 at 07:36

Gregor Thomas

136,190
20
167
294

1

Clever to include the ID in the `duplicated` call – talat Jul 29 '16 at 07:38
Thanks! I was starting to write an `ifelse` answer, but felt like there should be a way to avoid a "group by" solution. And then I realized I could just do direct replacement. – Gregor Thomas Jul 29 '16 at 07:39
Thanks for the answer! Got a minor error regarding brackets. After changing syntax to `df[duplicated(df[c("ID", "X1")]), "X1"] = NA` it works perfectly – Joep_S Jul 29 '16 at 07:54
Using the `is.na<-` method, `is.na(df$X1) <- duplicated(df[c("ID", "X1")])`. – lmo Jul 29 '16 at 12:03

score 1 · Answer 2 · answered Jul 29 '16 at 08:50

We can use dplyr

library(dplyr)
df %>% 
  group_by(ID) %>% 
  mutate_each(funs(replace(., duplicated(.), NA)), X1:X2)
#      ID     X1     X2    X3
#   <int> <fctr> <fctr> <int>
#1     1      A      X    23
#2     2      B      Y     4
#3     3      A      X    32
#4     3      B     NA     6
#5     4      A      Y    45
#6     4      B     NA     7
#7     4     NA      Z     5
#8     4     NA     NA     3

score 0 · Answer 3 · edited Jul 29 '16 at 07:50

0

You could try:

library(data.table)
setDT(df)
df[, c("X1","X2") := .(ifelse(duplicated(X1), NA, X1), ifelse(duplicated(X2), NA, X2)), by = ID]

Result:

    ID X1 X2 X3
1:  1  A  X 23
2:  2  B  Y  4
3:  3  A  X 32
4:  3  B NA  6
5:  4  A  Y 45
6:  4  B NA  7
7:  4 NA  Z  5
8:  4 NA NA  3

edited Jul 29 '16 at 07:50

Jaap

81,064
34
182
193

answered Jul 29 '16 at 07:36

Assign NA to duplicate values within dataframe prior to using reshape

3 Answers3