In your problem, the objective is to obtain the combinations of numbers in the vector such that the difference of the pair is greater than 1, but you don't want to make unnecessary comparisons that waste processing time. While I like Ritchie Sacramento's answer for contiguous sequences, I wanted to provide an answer that can handle gaps in the sequence, unsorted sequences, and non-integer sequences. The only way I can think to make this more 'efficient' is to iterate over the sequence and gather all the numbers that are at least 1 greater than the current iterand.
In my approach, I first enforce the sequence is sorted. Then I loop over each element in the sequence and compare only the indices after the current one. If these meet the difference criteria, I add them to a running list of pairs. This assumes you don't have an repeated numbers, otherwise you might want to first run unique()
.
#' Find pairs in a sequence with a difference greater than a specified offset.
#'
#' This function sorts the input sequence, and then iterates through the sorted sequence
#' to find pairs of numbers with a difference greater than the specified difference offset.
#' Pairs are returned in a matrix where pair[1] > pair[2].
#'
#' @param x A numeric vector representing the input sequence.
#' @param differenceOffset A numeric value representing the minimum difference between pairs. Default is 1.
#' @return A matrix where each row represents a pair with a difference greater than the difference offset.
#' @examples
#' find_pairs(c(1,2,3,4,5)) # uses default differenceOffset of 1
#' find_pairs(c(2,4,6,8,10), 2)
find_pairs <- function(x, differenceOffset = 1) {
# sort the sequence
x_sorted <- sort(x)
len <- length(x_sorted)
# initialize a list to store the pairs
pairs <- list()
# iterate through the sorted sequence
for(current in 1:(len - 1)) {
# iterate and test the next index forward of the current
for(test in (current + 1):len) {
# if the difference is > differenceOffset, add the pair to the list
if(x_sorted[test] - x_sorted[current] > differenceOffset) {
# flip the order to put pair[1] > pair[2] in the output
pairs <- c(pairs, list(c(x_sorted[test], x_sorted[current])))
}
}
}
# convert the list to a matrix
do.call(rbind, pairs)
}
Here's a couple examples of the usage.
# unsorted example
> find_pairs(c(4,3,2,5,1))
[,1] [,2]
[1,] 3 1
[2,] 4 1
[3,] 5 1
[4,] 4 2
[5,] 5 2
[6,] 5 3
# use difference offset of >2
> find_pairs(c(1,2,3,4,5), 2)
[,1] [,2]
[1,] 4 1
[2,] 5 1
[3,] 5 2
# your example
> find_pairs(2010:2020)
[,1] [,2]
[1,] 2012 2010
[2,] 2013 2010
[3,] 2014 2010
[4,] 2015 2010
[5,] 2016 2010
[6,] 2017 2010
[7,] 2018 2010
[8,] 2019 2010
[9,] 2020 2010
[10,] 2013 2011
...
[40,] 2018 2016
[41,] 2019 2016
[42,] 2020 2016
[43,] 2019 2017
[44,] 2020 2017
[45,] 2020 2018
# gap example
> find_pairs(c(1:2, 6:8))
[,1] [,2]
[1,] 6 1
[2,] 7 1
[3,] 8 1
[4,] 6 2
[5,] 7 2
[6,] 8 2
[7,] 8 6
# non-integer example
> find_pairs(seq(1, 3, by = 0.25), 1)
[,1] [,2]
[1,] 2.25 1.00
[2,] 2.50 1.00
[3,] 2.75 1.00
[4,] 3.00 1.00
[5,] 2.50 1.25
[6,] 2.75 1.25
[7,] 3.00 1.25
[8,] 2.75 1.50
[9,] 3.00 1.50
[10,] 3.00 1.75
EDIT
In terms of time complexity, the solution above might have a total time complexity [O(n log n + n)] less than your solution using expand.grid
[O(n^2)] for long vectors.
jblood94's answer with benchmarking showed the above approach was not really efficient at all. In the original function, find_pairs
, I attempted to spell the process out as explicitly as possible, which led to considerable computational overhead. It is often a fun challenge to write complicated one-liners that are the fast and flashy, but this is almost always at the cost of being almost entirely illegible post hoc. Of course, using a package like data.table
that is designed specifically for optimizing this kind of problem would be the best solution. But, maintaining the same integrity of clear and concise code, and using base R, I submit a revision below in find_pairs.2
.
In this version I follow a similar thread, but I make a few optimizations. First, I allocate a large matrix for the pairs rather than an empty list. Though this is less memory efficient, it appears to affect the timing. Second, I utilize a while-loop to find the the first index whose value exceeds the differenceOffset
. Once we find this index, and because we have sorted the array, we know all values including and beyond it must be greater than differenceOffset
. I then simply set the current value (through scalar expansion) alongside the the values larger than our test condition directly into the matrix into the next available positions (determined by a pair_count
er). Finally, I return trimmed matrix.
find_pairs.2 <- function(x, differenceOffset = 1) {
# sort the sequence
x <- sort(x)
len <- length(x)
# initialize a matrix to store the pairs
# assume that in the worst case, we have len*(len-1)/2 pairs
max_pairs <- len * (len - 1) / 2
pairs <- matrix(nrow = max_pairs, ncol = 2)
# initialize a counter for the number of pairs found
pair_count <- 0
# iterate through the sorted sequence
for (current in 1:(len - 1)) {
# find the first index (test) that is strictly greater than differenceOffset
test <- current + 1
while (test <= len && x[test] - x[current] <= differenceOffset) {
test <- test + 1
}
# if test is within bounds, add all pairs from test to len to the matrix
if (test <= len) {
num_new_pairs <- len - test + 1
pairs[(pair_count + 1):(pair_count + num_new_pairs), 1] <- x[test:len]
pairs[(pair_count + 1):(pair_count + num_new_pairs), 2] <- x[current]
pair_count <- pair_count + num_new_pairs
}
}
# trim unused slots in the matrix
return(pairs[1:pair_count, ])
}
If you really want to speed up your calculations, you can use the Rcpp
package and compile a C++ version of the new solution (benchmarked below).
find_pairs.cpp
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix find_pairs_cpp(NumericVector x, int differenceOffset = 1) {
// Sort the sequence
std::sort(x.begin(), x.end());
int len = x.size();
// Initialize a matrix to store the pairs
// Assume that in the worst case, we have len*(len-1)/2 pairs
int max_pairs = len * (len - 1) / 2;
NumericMatrix pairs(max_pairs, 2);
// Initialize a counter for the number of pairs found
int pair_count = 0;
// Iterate through the sorted sequence
for (int current = 0; current < len - 1; ++current) {
// Find the first index (test) that is strictly greater than differenceOffset
int test = current + 1;
while (test < len && x[test] - x[current] <= differenceOffset) {
++test;
}
// If test is within bounds, add all pairs from test to len to the matrix
if (test < len) {
int num_new_pairs = len - test;
for (int j = 0; j < num_new_pairs; ++j) {
pairs(pair_count + j, 0) = x[test + j];
pairs(pair_count + j, 1) = x[current];
}
pair_count += num_new_pairs;
}
}
// Trim unused slots in the matrix
NumericMatrix result = pairs(Range(0, pair_count - 1), _);
return result;
}
Then, source the C++ function like so:
# Load Rcpp package
library(Rcpp)
# Source the C++ script
sourceCpp("find_pairs.cpp")
# Now you can use your C++ function in R
x <- 2010:2020
find_pairs_cpp(x)
Benchmarking
As a side note. I benchmarked this against the functions defined in jblood94's answer and see remarkable improvement, especially since the original find_pairs
function didn't even survive in the larger test!
# "small" vector
x <- runif(1e2, 0, 10)
bm_small <- microbenchmark::microbenchmark(
expand.grid = nrow(f0(x)),
find_pairs.2 = nrow(find_pairs.2(x)),
find_pairs_cpp = nrow(find_pairs_cpp(x)),
f1 = nrow(f1(x)),
f2 = nrow(f2(x)),
f3 = nrow(f3(x)),
f4 = nrow(f4(x)),
check = "equal",
times = 1000
)
> print(bm_small)
Unit: microseconds
expr min lq mean median uq max neval
expand.grid 776.5 827.80 878.1540 853.60 895.10 1559.9 1000
find_pairs.2 394.3 432.10 462.8021 449.40 476.60 1042.6 1000
find_pairs_cpp 28.1 39.60 48.8424 43.30 48.20 138.5 1000
f1 769.2 823.30 903.5507 847.10 885.60 21644.9 1000
f2 657.8 700.10 832.8963 721.30 748.95 32129.7 1000
f3 3387.0 3508.05 3768.5233 3565.00 3655.65 50499.4 1000
f4 2511.5 2608.00 2770.2623 2650.95 2720.60 26553.3 1000
x <- runif(1e4, 0, 10)
bm_large <- microbenchmark::microbenchmark(
expand.grid = nrow(f0(x)),
find_pairs.2 = nrow(find_pairs.2(x)), # runs for several minutes without completing
f1 = nrow(f1(x)),
f2 = nrow(f2(x)),
f3 = nrow(f3(x)),
f4 = nrow(f4(x)),
check = "equal",
times = 10
)
> print(bm_large)
Unit: milliseconds
expr min lq mean median uq max neval
expand.grid 5344.5752 5551.6006 5640.7089 5689.3884 5750.9637 5767.5464 10
find_pairs.2 2531.3152 2574.1561 2679.6633 2643.0467 2788.3917 2879.4604 10
find_pairs_cpp 405.0732 458.4121 505.8300 494.3591 531.5002 640.3842 10
f1 2933.1794 2973.3883 3061.8112 3063.7269 3151.2178 3226.2590 10
f2 906.5848 927.4680 979.4814 943.6236 962.3983 1177.1214 10
f3 654.4493 699.1284 798.1042 787.8151 905.0556 958.2739 10
f4 946.0565 1006.1900 1108.1919 1122.6784 1145.9541 1301.9947 10
Cheers!