0

I have the following data.frame.

a <- c(rep("A", 3), rep("B", 3), rep("C",2), "D")
b <- c(NA,1,2,4,1,NA,2,NA,NA)
c <- c(1,1,2,4,1,1,2,2,2)
d <- c(1,2,3,4,5,6,7,8,9)
df <-data.frame(a,b,c,d)


  a  b c d
1 A NA 1 1
2 A  1 1 2
3 A  2 2 3
4 B  4 4 4
5 B  1 1 5
6 B NA 1 6
7 C  2 2 7
8 C NA 2 8
9 D NA 2 9

I want to remove duplicate rows (based on column A & C) so that the row with values in column B are kept. In this example, rows 1, 6, and 8 are removed.

Jay
  • 741
  • 10
  • 26

3 Answers3

1

One way to do this is to order by 'a', 'b' and the the logical vector based on 'b' so that all 'NA' elements will be last for each group of 'a', and 'b'. Then, apply the duplicated and keep only the non-duplicate elements

df1 <- df[order(df$a, df$b, is.na(df$b)),]
df2 <- df1[!duplicated(df1[c('a', 'c')]),]
df2
#  a  b c d
#2 A  1 1 2
#3 A  2 2 3
#5 B  1 1 5
#4 B  4 4 4
#7 C  2 2 7
#9 D NA 2 9

setdiff(seq_len(nrow(df)), row.names(df2) )
#[1] 1 6 8
akrun
  • 874,273
  • 37
  • 540
  • 662
  • This works. What's the purpose of `is.na(df$b)` in the order function? I tried it after removing it and df2 is identical. – Jay Mar 21 '17 at 06:12
  • @Stephen I just updated the post with the description. It is to make sure the NA elements are last for each 'a', 'b' pair – akrun Mar 21 '17 at 06:14
1

First create two datasets, one with duplicates in column a and one without duplicate in column a using the below function :

x = df[df$a %in% names(which(table(df$a) > 1)), ]
x1 = df[df$a %in% names(which(table(df$a) ==1)), ]

Now use na.omit function on data set x to delete the rows with NA and then rbind x and x1 to the final data set.

rbind(na.omit(x),x1)

Answer:

   a  b c d

2  A  1 1 2

3  A  2 2 3

4  B  4 4 4

5  B  1 1 5

7  C  2 2 7

9  D NA 2 9
Sotos
  • 51,121
  • 6
  • 32
  • 66
Ranjan Raj
  • 11
  • 3
0

You can use dplyr to do this.

df %>% distinct(a, c,  .keep_all = TRUE)                 

Output

  a  b c d
1 A NA 1 1
2 A  2 2 3
3 B  4 4 4
4 B  1 1 5
5 C  2 2 7
6 D NA 2 9

There are other options in dplyr, check this question for details: Remove duplicated rows using dplyr

Community
  • 1
  • 1
discipulus
  • 2,665
  • 3
  • 34
  • 51