58

Here is an example of my data set;

Date      Time(GMT)Depth Temp  Salinity Density Phosphate
24/06/2002  1000    1           33.855          0.01
24/06/2002  1000    45          33.827          0.01
01/07/2002  1000    10  13.26   33.104  24.873  0.06
01/07/2002  1000    30  12.01   33.787  25.646  0.13
08/07/2002  1000    5   13.34   33.609  25.248  0.01
08/07/2002  1000    40  12.01   34.258  26.011  1.33
15/07/2002  1000    30  12.04   34.507  26.199  0.01
22/07/2002  1000    5   13.93   33.792  25.269  0.01
22/07/2002  1000    30  11.9    34.438  26.172  0.08
29/07/2002  1000    5   13.23   34.09   25.642  0.01

I want to delete duplicate rows so that I only have one row per date, I want to do this based on the Depth, I would like to keep the row with the greatest (deepest) depth. Any ideas?

helen.h
  • 943
  • 2
  • 7
  • 18

7 Answers7

92

Lets say you have data in df

df = df[order(df[,'Date'],-df[,'Depth']),]
df = df[!duplicated(df$Date),]
vrajs5
  • 4,066
  • 1
  • 27
  • 44
18

Here's one way to do it in a single dplyr call:

# Remove any duplicates
df <- df %>%
  arrange(Date, -Depth) %>%
  filter(duplicated(Date) == FALSE)
Ryan Bradley
  • 627
  • 6
  • 9
11

Introducing a data.table solution which will be the fastest way to solve this (assuming data is your data set)

library(data.table)
unique(setDT(data)[order(Date, -Depth)], by = "Date")

Just another way:

setDT(data)[data[, .I[which.max(Depth)], by=Date]$V1]
Arun
  • 116,683
  • 26
  • 284
  • 387
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
4

This might be not the fastest approach if your data frame is large, but a fairly strightforward one. This might change the order of your data frame and you might need to reorder by e.g. date afterwards. Instead of deleting we split the data by date, in each chunk pick a row with the maximum date and finally join the result back into a data frame

data = split(data, data$Date)
data = lapply(data, function(x) x[which.max(x$Depth), , drop=FALSE])
data = do.call("rbind", data)
Oleg Sklyar
  • 9,834
  • 6
  • 39
  • 62
4

You might also use dplyr's arrange() instead of order (I find it more intuitive):

df <- arrange(df, Date, -Depth)
df <- df[!duplicated(df$Date),]
3
# First find the maxvalues
maxvals = aggregate(df$Depth~df$Date, FUN=max)
#Now use apply to find the matching rows and separate them out
out = df[apply(maxvals,1,FUN=function(x) which(paste(df$Date,df$Depth) == paste(x[1],x[2]))),]

Does that work for you?

ThatGuy
  • 1,225
  • 10
  • 28
1

An alternative approach using dplyr's distinct function:

library(dplyr)
df %>% distinct(column, .keep_all = TRUE)
Mumtaj Ali
  • 421
  • 4
  • 7