1

I have got a dataframe and I would like to remove some duplicate rows by taking the ones with max values.

Here an example simplified of my dataframe:

    Code   Weight  Year
1   27009  289     1975
2   27009  300     1975
3   27009  376     1977
4   30010  259     1975
5   30010  501     1979
6   30010  398     1979
[....]

My output should be:

    Code   Weight  Year
1   27009  300     1975
2   27009  376     1977
3   30010  259     1975
4   30010  501     1979
[....]

Between Code and Weight I have got 5 more columns with different values and between Weight and Year one more column with still different values.

Should I use an if statement?

Kara
  • 6,115
  • 16
  • 50
  • 57

2 Answers2

0

You could use the dplyr package:

df <- read.table(text = "Code   Weight  Year
                         27009  289     1975
                         27009  300     1975
                         27009  376     1977
                         30010  259     1975
                         30010  501     1979
                         30010  398     1979", header = TRUE)

library(dplyr)
df$x <- rnorm(6)

df %>% 
  group_by(Year, Code) %>% 
  slice(which.max(Weight))

#    Code Weight  Year          x
#   (int)  (int) (int)      (dbl)
# 1 27009    300  1975  1.3696332
# 2 30010    259  1975  1.1095553
# 3 27009    376  1977 -1.0672932
# 4 30010    501  1979  0.1152063

As a second solution you coud use the data.table package.

setDT(df)
df[order(-Weight) ,head(.SD,1), keyby = .(Year, Code)]

The results are the same.

Alex
  • 4,925
  • 2
  • 32
  • 48
  • hi, thanks a lot. The thing is that with your code I lose all my other columns between Code, Weight and Year. –  Jul 11 '16 at 16:31
  • @user181187 I added another variable. I think that matches the problem you are facing. With slice you can get the desired result. – Alex Jul 11 '16 at 16:36
  • thanks. very helpful. I am coding everyday in R since a month and there are so many functions that I don't know yet. thanks –  Jul 11 '16 at 17:15
0

Simply run aggregate in base R using Code and Year as the grouping. This will take max values of all other numeric columns:

finaldf <- aggregate(. ~ Code + Year, df, FUN = max)
Parfait
  • 104,375
  • 17
  • 94
  • 125