Remove semi duplicate rows in R

Question

I have the following data.frame.

a <- c(rep("A", 3), rep("B", 3), rep("C",2), "D")
b <- c(NA,1,2,4,1,NA,2,NA,NA)
c <- c(1,1,2,4,1,1,2,2,2)
d <- c(1,2,3,4,5,6,7,8,9)
df <-data.frame(a,b,c,d)


  a  b c d
1 A NA 1 1
2 A  1 1 2
3 A  2 2 3
4 B  4 4 4
5 B  1 1 5
6 B NA 1 6
7 C  2 2 7
8 C NA 2 8
9 D NA 2 9

I want to remove duplicate rows (based on column A & C) so that the row with values in column B are kept. In this example, rows 1, 6, and 8 are removed.

Not quite. That removes rows 5 & 6 and removes row 2 instead of 1. — Jay, Mar 21 '17 at 06:08

akrun · Answer 1 · 2017-03-21T06:13:29.980

1

One way to do this is to order by 'a', 'b' and the the logical vector based on 'b' so that all 'NA' elements will be last for each group of 'a', and 'b'. Then, apply the duplicated and keep only the non-duplicate elements

df1 <- df[order(df$a, df$b, is.na(df$b)),]
df2 <- df1[!duplicated(df1[c('a', 'c')]),]
df2
#  a  b c d
#2 A  1 1 2
#3 A  2 2 3
#5 B  1 1 5
#4 B  4 4 4
#7 C  2 2 7
#9 D NA 2 9

setdiff(seq_len(nrow(df)), row.names(df2) )
#[1] 1 6 8

edited Mar 21 '17 at 06:13

answered Mar 21 '17 at 06:08

akrun

874,273
37
540
662

This works. What's the purpose of `is.na(df$b)` in the order function? I tried it after removing it and df2 is identical. – Jay Mar 21 '17 at 06:12
@Stephen I just updated the post with the description. It is to make sure the NA elements are last for each 'a', 'b' pair – akrun Mar 21 '17 at 06:14

score 1 · Answer 2 · edited Mar 21 '17 at 07:23

First create two datasets, one with duplicates in column a and one without duplicate in column a using the below function :

x = df[df$a %in% names(which(table(df$a) > 1)), ]
x1 = df[df$a %in% names(which(table(df$a) ==1)), ]

Now use na.omit function on data set x to delete the rows with NA and then rbind x and x1 to the final data set.

rbind(na.omit(x),x1)

Answer:

   a  b c d

2  A  1 1 2

3  A  2 2 3

4  B  4 4 4

5  B  1 1 5

7  C  2 2 7

9  D NA 2 9

score 0 · Answer 3 · edited May 23 '17 at 12:09

0

You can use dplyr to do this.

df %>% distinct(a, c,  .keep_all = TRUE)

Output

There are other options in dplyr, check this question for details: Remove duplicated rows using dplyr

edited May 23 '17 at 12:09

Community

1
1

answered Mar 21 '17 at 06:23

discipulus

2,665
3
34
51

Remove semi duplicate rows in R

3 Answers3

Output