3

Thank you to anyone who could help me with this. I have tried to figure this out for days now without luck. My apologies if the solution was out there but extensive web search did not help.

So I have two datasets df and df2, df1 being my dataset where I have pseudo-duplicates (duplicates if I only consider certain variables) and df2 is my lookup table.

df <- data.frame(
  x = c("green", "green", "blue", "orange", "orange"),
  y = c("W12", "W12", "W12", "W11", "W12"),
  z = c(23, 54, 21, 16, 54)
  )
df2 <- data.frame(y=c("W12","W11"), z=c(54, 16))

So, we have:

> df
       x   y  z
1  green W12 23
2  green W12 54
3   blue W12 21
4 orange W11 16
5 orange W12 54

> df2
     y  z
 1 W12 54
 2 W11 16

I am looking for a way to not only weed out one of the duplicates based on (x,y), but to be able to tell R which to keep based on the value of z in the look up table. So here, keep the record #2 but not based on its position in the dataset (in my real date, the value of z is sometimes large and other time small, depending on y).

I have tried using !replicate() but cannot find a way to point to the reference table, just to retain either the first record (or the last).

df_dup<-df[c("x", "y")]
df[!duplicated(df_dup),]

I also tried something along the lines of

ddply(df,c("x", "y"), 
             function(v) {
               if (nrow(v)>1) v[which(c(df$y, df$z) %in% c(df2$y, df2$z)), ]
               if (nrow(v)==1) v
               }
               )
df %>% 
  group_by(x,y) %>% 
  filter(c(df$y,df$z) %in% c(df2$y,df2$z))

But something funky is happening here, and the %in% does not match the pairs exactly but any combinations of (y,z).

The output I am hoping for is

 df
       x   y  z
2  green W12 54
3   blue W12 21
4 orange W11 16
5 orange W12 54

But with Row#2 chosen not because it is the last row but because it matches the lookup table. In my longer dataset, the rows to keep might end up being the first or the second.

Thank you in advance again to anyone who can find a way to do this in R. Ultimately, I will need to do this on a gigantic dataset and with several variables as grouping variables with only one of them being part of the lookup table.

Marie T.
  • 33
  • 4
  • If you are keeping it based on what's in df2, that sounds a lot like keeping the row with 54 (since it appears on df2). You could make it clearer by showing your desired output. – Frank Sep 23 '16 at 19:23
  • Yes please add the output you expect to make it easier to understand. When I had to find unique values across multiple columns one hack I have used is to create a single column with the values concatenated and compare that. – Rohit Das Sep 23 '16 at 19:27
  • Frank, I am sorry, yes, i want to keep row #2. I ended changing the order when I posted the code. – Marie T. Sep 23 '16 at 19:38

2 Answers2

2

I might do...

library(data.table)
setDT(df); setDT(df2)

ord = +is.na(df2[df, on=c("y", "z"), which=TRUE])
unique(df[ order(ord) ], by=c("x","y"))

        x   y  z
1:  green W12 54
2: orange W11 16
3: orange W12 54
4:   blue W12 21

This prioritizes rows with matches in df2; but if you want to to do the opposite (as it looked like in an earlier version of the question), just put a - in the definition of ord instead of a +.


How it works:

X[Y, on, which=TRUE] returns, for each row of Y, the row(s) of X that are matched. If there are multiple matches, they are all returned (but in your lookup table, there's no reason to have repeats). If there is no match, a missing value is returned.

+is.na(w) where w is a vector of row numbers returns a vector we can sort by:

  • 1 if w is a missing value
  • 0 otherwise

unique(Y[order(ord)], by) sorts Y by our vector and then drops duplicates as usual, keeping the first observation per group. You could alternately do Y[order(ord), .SD[1L], by] for this step.

Frank
  • 66,179
  • 8
  • 96
  • 180
  • 1
    Thank you Frank! it works great! I tried your code on another dummy record a little more complicated and then let it rip through my dataset of 1M record and it did its job flawlessly. Again thank you. – Marie T. Oct 04 '16 at 21:43
0

One approach is the following:

  1. Find all the rows that have duplicates for x and y in df. For this, we use Sven Hohenstein's answer found here:

    dup.ind <- which(duplicated(df[,c("x","y")]) | duplicated(df[,c("x","y")], fromLast = TRUE))
    
  2. We also want to keep all other rows (that do not have duplicates) in the result so we use setdiff to identify those:

    other.ind <- setdiff(seq_len(nrow(df)), dup.ind)
    
  3. From dup.ind keep only those for which the z value in df is equal to that in df2 for the matching y values. Here, df2$z[match(df$y[dup.ind], df2$y)] looks up the z value in df2 for each dup.ind:

    keep.ind <- dup.ind[df$z[dup.ind] == df2$z[match(df$y[dup.ind], df2$y)]]
    
  4. Subset the original df using c(keep.ind,other.ind). Here, we sort these to maintain the original order (but that is not necessary):

    result <- df[sort(c(keep.ind, other.ind)),]
    

Using your input data, the result is:

print(result)
##       x   y  z
##2  green W12 54
##3   blue W12 21
##4 orange W11 16
##5 orange W12 54
Community
  • 1
  • 1
aichao
  • 7,375
  • 3
  • 16
  • 18