0

Is it possible to customize setdiff using regular expressions to see what is in one vector and not another? For example:

x <- c("1\t119\t120\t1\t119\t120\tABC\tDEF\t0", "2\t558\t559\t2\t558\t559\tGHI\tJKL\t0", "3\t139\t141\t3\t139\t141\tMNO\tPQR\t0", "3\t139\t143\t3\t139\t143\tSTU\tVWX\t0")

[1] "1\t119\t120\t1\t119\t120\tABC\tDEF\t0"
[2] "2\t558\t559\t2\t558\t559\tGHI\tJKL\t0"       
[3] "3\t139\t141\t3\t139\t141\tMNO\tPQR\t0"   
[4] "3\t139\t143\t3\t139\t143\tSTU\tVWX\t0"   

y <- c("1\t119\t120\t1\t109\t120\tABC\tDEF\t0", "2\t558\t559\t2\t548\t559\tGHI\tJKL\t0", "3\t139\t141\t3\t129\t141\tMNO\tPQR\t0", "3\t139\t143\t3\t129\t143\tSTU\tVWX\t0", "4\t157\t158\t4\t147\t158\tXWX\tYTY\t0", "5\t158\t159\t5\t148\t159\tPHP\tWZW\t0")

[1] "1\t119\t120\t1\t109\t120\tABC\tDEF\t0"
[2] "2\t558\t559\t2\t548\t559\tGHI\tJKL\t0"       
[3] "3\t139\t141\t3\t129\t141\tMNO\tPQR\t0"   
[4] "3\t139\t143\t3\t129\t143\tSTU\tVWX\t0"   
[5] "4\t157\t158\t4\t147\t158\tXWX\tYTY\t0"  
[6] "5\t158\t159\t5\t148\t159\tPHP\tWZW\t0" 

I want to be able to show that:

[5] "4\t157\t158\t4\t147\t158\tXWX\tYTY\t0"  
[6] "5\t158\t159\t5\t148\t159\tPHP\tWZW\t0" 

are new because 4\t157\t158 and 4\t157\t158 are unique to y. This doesn't work:

> setdiff(y,x)
[1] "1\t119\t120\t1\t109\t120\tABC\tDEF\t0" "2\t558\t559\t2\t548\t559\tGHI\tJKL\t0"
[3] "3\t139\t141\t3\t129\t141\tMNO\tPQR\t0" "3\t139\t143\t3\t129\t143\tSTU\tVWX\t0"
[5] "4\t157\t158\t4\t147\t158\tXWX\tYTY\t0" "5\t158\t159\t5\t148\t159\tPHP\tWZW\t0"  

Because column 5 is clearly different in both x and y. I want to setdiff only based on the first three columns.

A simple example of setdiff can be found here: How to tell what is in one vector and not another?

Community
  • 1
  • 1
warship
  • 2,924
  • 6
  • 39
  • 65

2 Answers2

4

One way to do this is to put x and y as data.frames and anti-join. I'll use data.table since I find it more natural.

library(data.table)
xDT <- as.data.table(do.call("rbind", strsplit(x, split = "\t")))
yDT <- as.data.table(do.call("rbind", strsplit(y, split = "\t")))

Now anti-join (a "setdiff" for data.frames/data.tables):

yDT[!xDT, on = paste0("V", 1:3)]
#    V1  V2  V3 V4  V5  V6  V7  V8 V9
# 1:  4 157 158  4 147 158 XWX YTY  0
# 2:  5 158 159  5 148 159 PHP WZW  0

You could also get the row index (thanks to @Frank for the suggested improvement/simplification):

> yDT[!xDT, which = TRUE, on = paste0("V", 1:3)]

Or extract it directly from y:

> y[yDT[!xDT, which = TRUE, on = paste0("V", 1:3)]]
# [1] "4\t157\t158\t4\t147\t158\tXWX\tYTY\t0" "5\t158\t159\t5\t148\t159\tPHP\tWZW\t0"
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • 1
    @Frank good call! `which` is one of those `data.table` arguments I've never found a chance to use so it doesn't come to mind. Duly noted. – MichaelChirico Feb 10 '16 at 03:23
3

We could also use anti_join from dplyr after reading it with either fread

library(data.table)
library(dplyr)
anti_join(fread(paste(y, collapse='\n')), 
        fread(paste(x, collapse='\n')), by = c('V1', 'V2', 'V3'))

#      V1    V2    V3    V4    V5    V6    V7    V8    V9
#    (int) (int) (int) (int) (int) (int) (chr) (chr) (int)
# 1     4   157   158     4   147   158   XWX   YTY     0
# 2     5   158   159     5   148   159   PHP   WZW     0

Or (as the title requests for regex) we can use regex to remove part of the string and then do the %in%

y[!sub('(([^\t]+\t){3}).*', '\\1', y) %in% 
     sub('(([^\t]+\t){3}).*', '\\1', x)]
#[1] "4\t157\t158\t4\t147\t158\tXWX\tYTY\t0" "5\t158\t159\t5\t148\t159\tPHP\tWZW\t0"
akrun
  • 874,273
  • 37
  • 540
  • 662