R setdiff() by regex

Question

Is it possible to customize setdiff using regular expressions to see what is in one vector and not another? For example:

x <- c("1\t119\t120\t1\t119\t120\tABC\tDEF\t0", "2\t558\t559\t2\t558\t559\tGHI\tJKL\t0", "3\t139\t141\t3\t139\t141\tMNO\tPQR\t0", "3\t139\t143\t3\t139\t143\tSTU\tVWX\t0")

[1] "1\t119\t120\t1\t119\t120\tABC\tDEF\t0"
[2] "2\t558\t559\t2\t558\t559\tGHI\tJKL\t0"       
[3] "3\t139\t141\t3\t139\t141\tMNO\tPQR\t0"   
[4] "3\t139\t143\t3\t139\t143\tSTU\tVWX\t0"   

y <- c("1\t119\t120\t1\t109\t120\tABC\tDEF\t0", "2\t558\t559\t2\t548\t559\tGHI\tJKL\t0", "3\t139\t141\t3\t129\t141\tMNO\tPQR\t0", "3\t139\t143\t3\t129\t143\tSTU\tVWX\t0", "4\t157\t158\t4\t147\t158\tXWX\tYTY\t0", "5\t158\t159\t5\t148\t159\tPHP\tWZW\t0")

[1] "1\t119\t120\t1\t109\t120\tABC\tDEF\t0"
[2] "2\t558\t559\t2\t548\t559\tGHI\tJKL\t0"       
[3] "3\t139\t141\t3\t129\t141\tMNO\tPQR\t0"   
[4] "3\t139\t143\t3\t129\t143\tSTU\tVWX\t0"   
[5] "4\t157\t158\t4\t147\t158\tXWX\tYTY\t0"  
[6] "5\t158\t159\t5\t148\t159\tPHP\tWZW\t0"

I want to be able to show that:

[5] "4\t157\t158\t4\t147\t158\tXWX\tYTY\t0"  
[6] "5\t158\t159\t5\t148\t159\tPHP\tWZW\t0"

are new because 4\t157\t158 and 4\t157\t158 are unique to y. This doesn't work:

> setdiff(y,x)
[1] "1\t119\t120\t1\t109\t120\tABC\tDEF\t0" "2\t558\t559\t2\t548\t559\tGHI\tJKL\t0"
[3] "3\t139\t141\t3\t129\t141\tMNO\tPQR\t0" "3\t139\t143\t3\t129\t143\tSTU\tVWX\t0"
[5] "4\t157\t158\t4\t147\t158\tXWX\tYTY\t0" "5\t158\t159\t5\t148\t159\tPHP\tWZW\t0"

Because column 5 is clearly different in both x and y. I want to setdiff only based on the first three columns.

A simple example of setdiff can be found here: How to tell what is in one vector and not another?

Is the setdiff supposed to be run as a by row or as a by dataframe test. I'm assuming that the "\t" is an expected separator. — IRTFM, Feb 06 '16 at 02:31

MichaelChirico · Accepted Answer · 2016-02-10T03:22:17.670

One way to do this is to put x and y as data.frames and anti-join. I'll use data.table since I find it more natural.

library(data.table)
xDT <- as.data.table(do.call("rbind", strsplit(x, split = "\t")))
yDT <- as.data.table(do.call("rbind", strsplit(y, split = "\t")))

Now anti-join (a "setdiff" for data.frames/data.tables):

yDT[!xDT, on = paste0("V", 1:3)]
#    V1  V2  V3 V4  V5  V6  V7  V8 V9
# 1:  4 157 158  4 147 158 XWX YTY  0
# 2:  5 158 159  5 148 159 PHP WZW  0

You could also get the row index (thanks to @Frank for the suggested improvement/simplification):

> yDT[!xDT, which = TRUE, on = paste0("V", 1:3)]

Or extract it directly from y:

> y[yDT[!xDT, which = TRUE, on = paste0("V", 1:3)]]
# [1] "4\t157\t158\t4\t147\t158\tXWX\tYTY\t0" "5\t158\t159\t5\t148\t159\tPHP\tWZW\t0"

@Frank good call! `which` is one of those `data.table` arguments I've never found a chance to use so it doesn't come to mind. Duly noted. — MichaelChirico, Feb 10 '16 at 03:23

akrun · Answer 2 · 2016-02-06T04:57:20.160

We could also use anti_join from dplyr after reading it with either fread

library(data.table)
library(dplyr)
anti_join(fread(paste(y, collapse='\n')), 
        fread(paste(x, collapse='\n')), by = c('V1', 'V2', 'V3'))

#      V1    V2    V3    V4    V5    V6    V7    V8    V9
#    (int) (int) (int) (int) (int) (int) (chr) (chr) (int)
# 1     4   157   158     4   147   158   XWX   YTY     0
# 2     5   158   159     5   148   159   PHP   WZW     0

Or (as the title requests for regex) we can use regex to remove part of the string and then do the %in%

y[!sub('(([^\t]+\t){3}).*', '\\1', y) %in% 
     sub('(([^\t]+\t){3}).*', '\\1', x)]
#[1] "4\t157\t158\t4\t147\t158\tXWX\tYTY\t0" "5\t158\t159\t5\t148\t159\tPHP\tWZW\t0"

R setdiff() by regex

2 Answers2