1

I have four large vectors of unequal length. Below I am providing a toy dataset similar to my original dataset:

a <- c(1021.923, 3491.31, 102.3, 12019.11, 879.2, 583.1)
b <- c(21,32,523,123.1,123.4545,12345,95.434, 879.25, 1021.9,11,12,662)
c <- c(52,21,1021.9288,12019.12, 879.1)
d <- c(432.432,23466.3,45435,3456,123,6688,1021.95)

Is there a way to compare all of these vectors one by one with an allowed threshold of ±0.5 for the match? In other words, I want to report the numbers that are common among all four vectors while allowing a drift of 0.5.

In the case of the toy dataset above, the final answer is:

    Match1
a 1021.923
b 1021.900
c 1021.929
d 1021.950

I understand that this is possible for two vectors, but how can I do it for 4 vectors?

RELATED

paropunam
  • 488
  • 2
  • 11
  • 1
    Write a function to do it for 2 vectors, call it `my_compare`, then `Reduce(my_compare, list(a, b, c, d))` – Gregor Thomas Apr 25 '19 at 14:54
  • Though, there are potential issues, e.g., if `a` has `100`, `b` has `100.4`, and `c` has `99.6`, depending how you want that handled. Running on different orderings or making `my_compare` accept and return lists of equivalencies could work. – Gregor Thomas Apr 25 '19 at 14:58

1 Answers1

1

Here is a data.table solution.

It is scalable to n vectors, so try feeding it as much as you like.. It also performs well when multiple values have 'hits' in all vectors.

sample data

a <- c(1021.923, 3491.31, 102.3, 12019.11, 879.2, 583.1)
b <- c(21,32,523,123.1,123.4545,12345,95.434, 879.25, 1021.9,11,12,662)
c <- c(52,21,1021.9288,12019.12, 879.1)
d <- c(432.432,23466.3,45435,3456,123,6688,1021.95)

code

library(data.table)

#create list with vectors
l <- list( a,b,c,d )
names(l) <- letters[1:4]
#create data.table to work with
DT <- rbindlist( lapply(l, function(x) {data.table( value = x)} ), idcol = "group")
#add margins to each value
DT[, `:=`( id = 1:.N, start = value - 0.5, end = value + 0.5 ) ]
#set keys for joining
setkey(DT, start, end)
#perform overlap-join
result <- foverlaps(DT,DT)

#cast, to check how the 'hits' each id has in each group (a,b,c,d)
answer <- dcast( result, 
             group + value ~ i.group, 
             fun.aggregate = function(x){ x * 1 }, 
             value.var = "i.value", 
             fill = NA )

#get your final answer
#set columns to look at (i.e. the names from the earlier created list)
cols = names(l)
#keep the rows without NA (use rowSums, because TRUE = 1, FALSE = 0 )
#so if rowSums == 0, then columns in the vactor 'cols' do not contain a 'NA'
answer[ rowSums( is.na( answer[ , ..cols ] ) ) == 0, ]

output

#    group    value        a      b        c       d
# 1:     a 1021.923 1021.923 1021.9 1021.929 1021.95
# 2:     b 1021.900 1021.923 1021.9 1021.929 1021.95
# 3:     c 1021.929 1021.923 1021.9 1021.929 1021.95
# 4:     d 1021.950 1021.923 1021.9 1021.929 1021.95
Wimpel
  • 26,031
  • 1
  • 20
  • 37