1

I am trying to find data where three out of four columns are duplicated, and then to remove duplicates but keep the row with the largest number for the otherwise identical data.

I found this very helpful article on the StackOverflow which I think gets me about half way there.

I will base my question of the example in that question. (The example has more columns than what I am working on but I don' think that really matters.)

require(tidyverse)
x = iris%>%select(-Petal.Width)
dups = x[x%>%duplicated(),]
answer =  iris%>%semi_join(dups)

> answer 
   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
1           5.1         3.5          1.4         0.2    setosa
2           4.9         3.1          1.5         0.1    setosa
3           4.8         3.0          1.4         0.1    setosa
4           5.1         3.5          1.4         0.3    setosa
5           4.9         3.1          1.5         0.2    setosa
6           4.8         3.0          1.4         0.3    setosa
7           5.8         2.7          5.1         1.9 virginica
8           6.7         3.3          5.7         2.1 virginica
9           6.4         2.8          5.6         2.1 virginica
10          6.4         2.8          5.6         2.2 virginica
11          5.8         2.7          5.1         1.9 virginica
12          6.7         3.3          5.7         2.5 virginica

That article introduced me to code that will identify all rows where everything is equal except petal width:

iris[duplicated(iris[,-4]) | duplicated(iris[,-4], fromLast = TRUE),]

This is great but I don't know how to progress from here. I would like to have rows 2 and 5 to collapse into a single row that is equal to row 5. Similarly 9 & 10, should become just 10, and 8 & 12 become just 12.

The data set I have has more than 2 rows in some sets of duplicates, so I haven't had any luck using arrange functions to order them and delete the smallest row.

Novo88
  • 109
  • 1
  • 7
  • 1
    wouldn't this be `answer %>% group_by_at(-4) %>% slice(which.max(Petal.Width))` ? Also there are other matches as well. For example, row 1 and 4. – Ronak Shah Aug 02 '19 at 07:22

1 Answers1

1

This should do what you want

iris %>%
   group_by(Sepal.Length,
            Sepal.Width,
            Petal.Length,
            Species) %>% 
   filter(Petal.Width == max(Petal.Width)) %>% 
   filter(row_number() == 1) %>% 
   ungroup()

The second filtering is to get rid of duplicates if the Petal.Width is also identical for two entries. Does this work for you?

jludewig
  • 428
  • 2
  • 8