Thank you to anyone who could help me with this. I have tried to figure this out for days now without luck. My apologies if the solution was out there but extensive web search did not help.
So I have two datasets df and df2, df1 being my dataset where I have pseudo-duplicates (duplicates if I only consider certain variables) and df2 is my lookup table.
df <- data.frame(
x = c("green", "green", "blue", "orange", "orange"),
y = c("W12", "W12", "W12", "W11", "W12"),
z = c(23, 54, 21, 16, 54)
)
df2 <- data.frame(y=c("W12","W11"), z=c(54, 16))
So, we have:
> df
x y z
1 green W12 23
2 green W12 54
3 blue W12 21
4 orange W11 16
5 orange W12 54
> df2
y z
1 W12 54
2 W11 16
I am looking for a way to not only weed out one of the duplicates based on (x,y), but to be able to tell R which to keep based on the value of z in the look up table. So here, keep the record #2 but not based on its position in the dataset (in my real date, the value of z is sometimes large and other time small, depending on y).
I have tried using !replicate() but cannot find a way to point to the reference table, just to retain either the first record (or the last).
df_dup<-df[c("x", "y")]
df[!duplicated(df_dup),]
I also tried something along the lines of
ddply(df,c("x", "y"),
function(v) {
if (nrow(v)>1) v[which(c(df$y, df$z) %in% c(df2$y, df2$z)), ]
if (nrow(v)==1) v
}
)
df %>%
group_by(x,y) %>%
filter(c(df$y,df$z) %in% c(df2$y,df2$z))
But something funky is happening here, and the %in% does not match the pairs exactly but any combinations of (y,z).
The output I am hoping for is
df
x y z
2 green W12 54
3 blue W12 21
4 orange W11 16
5 orange W12 54
But with Row#2 chosen not because it is the last row but because it matches the lookup table. In my longer dataset, the rows to keep might end up being the first or the second.
Thank you in advance again to anyone who can find a way to do this in R. Ultimately, I will need to do this on a gigantic dataset and with several variables as grouping variables with only one of them being part of the lookup table.