Deleting rows that are duplicated in one column based on the conditions of another column

Question

Here is an example of my data set;

Date      Time(GMT)Depth Temp  Salinity Density Phosphate
24/06/2002  1000    1           33.855          0.01
24/06/2002  1000    45          33.827          0.01
01/07/2002  1000    10  13.26   33.104  24.873  0.06
01/07/2002  1000    30  12.01   33.787  25.646  0.13
08/07/2002  1000    5   13.34   33.609  25.248  0.01
08/07/2002  1000    40  12.01   34.258  26.011  1.33
15/07/2002  1000    30  12.04   34.507  26.199  0.01
22/07/2002  1000    5   13.93   33.792  25.269  0.01
22/07/2002  1000    30  11.9    34.438  26.172  0.08
29/07/2002  1000    5   13.23   34.09   25.642  0.01

I want to delete duplicate rows so that I only have one row per date, I want to do this based on the Depth, I would like to keep the row with the greatest (deepest) depth. Any ideas?

score 92 · Accepted Answer · answered Jun 03 '14 at 09:25

92

Lets say you have data in df

df = df[order(df[,'Date'],-df[,'Depth']),]
df = df[!duplicated(df$Date),]

answered Jun 03 '14 at 09:25

vrajs5

4,066
1
27
44

1

Elegant, much better than my one. +1 – Oleg Sklyar Jun 03 '14 at 09:28
1

You should probably switch `duplicated[df$Date]` to `duplicated(df$Date)`. Other than that, this is the only real R solution here – David Arenburg Jun 03 '14 at 09:28

score 18 · Answer 2 · answered Dec 30 '20 at 14:35

18

Here's one way to do it in a single dplyr call:

# Remove any duplicates
df <- df %>%
  arrange(Date, -Depth) %>%
  filter(duplicated(Date) == FALSE)

answered Dec 30 '20 at 14:35

Ryan Bradley

627
6
9

1

Nice. I would prefer `!duplicated(Date)` instead of `duplicated(Date) == FALSE`. – Stefan Jelkovich Sep 08 '21 at 09:21

score 11 · Answer 3 · edited Jun 03 '14 at 10:37

11

Introducing a data.table solution which will be the fastest way to solve this (assuming data is your data set)

library(data.table)
unique(setDT(data)[order(Date, -Depth)], by = "Date")

Just another way:

setDT(data)[data[, .I[which.max(Depth)], by=Date]$V1]

edited Jun 03 '14 at 10:37

Arun

116,683
26
284
387

answered Jun 03 '14 at 09:35

David Arenburg

91,361
17
137
196

score 4 · Answer 4 · answered Jun 03 '14 at 09:16

4

This might be not the fastest approach if your data frame is large, but a fairly strightforward one. This might change the order of your data frame and you might need to reorder by e.g. date afterwards. Instead of deleting we split the data by date, in each chunk pick a row with the maximum date and finally join the result back into a data frame

data = split(data, data$Date)
data = lapply(data, function(x) x[which.max(x$Depth), , drop=FALSE])
data = do.call("rbind", data)

answered Jun 03 '14 at 09:16

Oleg Sklyar

9,834
6
39
62

maybe you can replace `split+lapply` by `by`. – agstudy Jun 03 '14 at 09:24
I was writing a similar answer to yours, but you beat me to it. I've added helen.h's actual data to your post so the desired output is shown. I hope that's alright. – Anders Ellern Bilgrau Jun 03 '14 at 09:25
@agstudy I'd love to, but do not have R installed where I am now to try it out :) – Oleg Sklyar Jun 03 '14 at 09:25

score 4 · Answer 5 · answered Nov 27 '18 at 19:47

4

You might also use dplyr's arrange() instead of order (I find it more intuitive):

df <- arrange(df, Date, -Depth)
df <- df[!duplicated(df$Date),]

answered Nov 27 '18 at 19:47

Pablo Herreros Cantis

529
8
23

score 3 · Answer 6 · answered Jun 03 '14 at 09:23

# First find the maxvalues
maxvals = aggregate(df$Depth~df$Date, FUN=max)
#Now use apply to find the matching rows and separate them out
out = df[apply(maxvals,1,FUN=function(x) which(paste(df$Date,df$Depth) == paste(x[1],x[2]))),]

Does that work for you?

score 1 · Answer 7 · answered Jan 29 '23 at 13:22

1

An alternative approach using dplyr's distinct function:

library(dplyr)
df %>% distinct(column, .keep_all = TRUE)

answered Jan 29 '23 at 13:22

Mumtaj Ali

421
4
7

Deleting rows that are duplicated in one column based on the conditions of another column

7 Answers7

Linked