Remove rows conditionally in a dataframe

Question

I have got a dataframe and I would like to remove some duplicate rows by taking the ones with max values.

Here an example simplified of my dataframe:

    Code   Weight  Year
1   27009  289     1975
2   27009  300     1975
3   27009  376     1977
4   30010  259     1975
5   30010  501     1979
6   30010  398     1979
[....]

My output should be:

    Code   Weight  Year
1   27009  300     1975
2   27009  376     1977
3   30010  259     1975
4   30010  501     1979
[....]

Between Code and Weight I have got 5 more columns with different values and between Weight and Year one more column with still different values.

Should I use an if statement?

Alex · Accepted Answer · 2016-07-11T16:52:29.233

0

You could use the dplyr package:

df <- read.table(text = "Code   Weight  Year
                         27009  289     1975
                         27009  300     1975
                         27009  376     1977
                         30010  259     1975
                         30010  501     1979
                         30010  398     1979", header = TRUE)

library(dplyr)
df$x <- rnorm(6)

df %>% 
  group_by(Year, Code) %>% 
  slice(which.max(Weight))

#    Code Weight  Year          x
#   (int)  (int) (int)      (dbl)
# 1 27009    300  1975  1.3696332
# 2 30010    259  1975  1.1095553
# 3 27009    376  1977 -1.0672932
# 4 30010    501  1979  0.1152063

As a second solution you coud use the data.table package.

setDT(df)
df[order(-Weight) ,head(.SD,1), keyby = .(Year, Code)]

The results are the same.

edited Jul 11 '16 at 16:52

answered Jul 11 '16 at 16:22

Alex

4,925
2
32
48

hi, thanks a lot. The thing is that with your code I lose all my other columns between Code, Weight and Year. – Jul 11 '16 at 16:31
@user181187 I added another variable. I think that matches the problem you are facing. With slice you can get the desired result. – Alex Jul 11 '16 at 16:36
thanks. very helpful. I am coding everyday in R since a month and there are so many functions that I don't know yet. thanks – Jul 11 '16 at 17:15

score 0 · Answer 2 · answered Jul 11 '16 at 17:05

0

Simply run aggregate in base R using Code and Year as the grouping. This will take max values of all other numeric columns:

finaldf <- aggregate(. ~ Code + Year, df, FUN = max)

answered Jul 11 '16 at 17:05

Parfait

104,375
17
94
125

Remove rows conditionally in a dataframe

2 Answers2