Output of pmatch changes with vector length

Question

I am trying to use pmatch in base R. The following example appears to work as expected:

treat1   <- c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2,
              2, 2, 2, 3, 3, 3, 3, 3, 3, 3,
              4, 4, 4, 4, 4, 4, 5, 5, 5, 5,
              5, 5, 5, 6, 6, 6, 6, 6, 6, 7,
              7, 7, 7, 7, 7, 7, 8, 8, 8, 8,
              8, 8, 9, 9, 9, 9, 9, 9, 9,10,
             10,10,10,10,10,10,11,11,11,11,
             11,11,12,12,12,12,12,12,12,13,
             13,13,13,13,13,14,14,14,14,14,
             14,14,15,15,15,15,15,15,16,16,
             16,16,16,16,16,17,17,17,17,17,
             17,18,18,18,18,18,18,18)

control1 <- c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2,
              2, 3, 3, 3, 3, 3, 4, 4, 4, 4,
              4, 4, 5, 5, 5, 5, 5, 6, 6, 6,
              6, 6, 6, 7, 7, 7, 7, 7, 8, 8,
              8, 8, 8, 8, 9, 9, 9, 9, 9,10,
             10,10,10,10,10,11,11,11,11,11,
             12,12,12,12,12,12,13,13,13,13,
             13,14,14,14,14,14,14,15,15,15,
             15,15,16,16,16,16,16,16,17,17,
             17,17,17,18,18,18,18,18,18)

pmatch(control1, treat1)
#[1]   1   2   3   4   5   8   9  10  11  12
#     13  14  15  16  17  18  21  22  23  24
#     25  26  27  28  29  30  31  34  35  36
#     37  38  39  40  41  42  43  44  47  48
#     49  50  51  52  53  54  55  56  57  60
#     61  62  63  64  65  67  68  69  70  71
#     73  74  75  76  77  78  80  81  82  83
#     84  86  87  88  89  90  91  93  94  95
#     96  97  99 100 101 102 103 104 106 107
#    108 109 110 112 113 114 115 116 117

However, the following example does not work as I expected. The only difference between the example above and the one below is the presence of a few additional elements of value 19 at the end of the vectors below. The output below contains numerous NA's and only seems to include the position in treat2 of the first element of a given value in control2. I have tried including some of the options for pmatch in the documentation but cannot get output similar to that shown above.

There are several similar questions on Stack Overflow, such as the following, but I have not found a solution to my issue:

Properties of pmatch function

treat2   <- c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2,
              2, 2, 2, 3, 3, 3, 3, 3, 3, 3,
              4, 4, 4, 4, 4, 4, 5, 5, 5, 5,
              5, 5, 5, 6, 6, 6, 6, 6, 6, 7,
              7, 7, 7, 7, 7, 7, 8, 8, 8, 8,
              8, 8, 9, 9, 9, 9, 9, 9, 9,10,
             10,10,10,10,10,10,11,11,11,11,
             11,11,12,12,12,12,12,12,12,13,
             13,13,13,13,13,14,14,14,14,14,
             14,14,15,15,15,15,15,15,16,16,
             16,16,16,16,16,17,17,17,17,17,
             17,18,18,18,18,18,18,18,19,19,
             19,19,19,19,19)

control2 <- c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2,
              2, 3, 3, 3, 3, 3, 4, 4, 4, 4,
              4, 4, 5, 5, 5, 5, 5, 6, 6, 6,
              6, 6, 6, 7, 7, 7, 7, 7, 8, 8,
              8, 8, 8, 8, 9, 9, 9, 9, 9,10,
             10,10,10,10,10,11,11,11,11,11,
             12,12,12,12,12,12,13,13,13,13,
             13,14,14,14,14,14,14,15,15,15,
             15,15,16,16,16,16,16,16,17,17,
             17,17,17,18,18,18,18,18,18,19,
             19,19,19,19)

pmatch(control2, treat2)
#[1]   1  NA  NA  NA  NA   8  NA  NA  NA  NA
#     NA  14  NA  NA  NA  NA  21  NA  NA  NA
#     NA  NA  27  NA  NA  NA  NA  34  NA  NA
#     NA  NA  NA  40  NA  NA  NA  NA  47  NA
#     NA  NA  NA  NA  53  NA  NA  NA  NA  60
#     NA  NA  NA  NA  NA  67  NA  NA  NA  NA
#     73  NA  NA  NA  NA  NA  80  NA  NA  NA
#     NA  86  NA  NA  NA  NA  NA  93  NA  NA
#     NA  NA  99  NA  NA  NA  NA  NA 106  NA
#     NA  NA  NA 112  NA  NA  NA  NA  NA 119
#     NA  NA  NA  NA

Have a look at [r: pmatch isn't working for big dataframe](https://stackoverflow.com/questions/36654375) . — GKi, Jul 27 '23 at 07:49

score 2 · Accepted Answer · answered Dec 05 '20 at 08:30

Given that your treat and control are always numbers, I think it might be easier (and faster) to just rewrite that function using Rcpp. Consider something like this

Rcpp::cppFunction('NumericVector cpmatch(NumericVector x, NumericVector table) {
  int n = x.size(), m = table.size();
  NumericVector out(n, NA_REAL), y = clone(table);
  for (int i = 0; i < n; i++) {
    if (ISNAN(x[i])) {
      continue;
    }
    for (int j = 0; j < m; j++) {
      if (!ISNAN(y[j]) & x[i] == y[j]) {
        y[j] = NA_REAL;
        out[i] = j + 1;
        break;
      }
    }
  }
  return out;
}')

Test

> cpmatch(control2, treat2)

  [1]   1   2   3   4   5   8   9  10  11  12  13  14  15  16  17  18  21  22  23  24  25  26  27  28  29  30  31  34  35  36  37  38  39  40  41  42  43
 [38]  44  47  48  49  50  51  52  53  54  55  56  57  60  61  62  63  64  65  67  68  69  70  71  73  74  75  76  77  78  80  81  82  83  84  86  87  88
 [75]  89  90  91  93  94  95  96  97  99 100 101 102 103 104 106 107 108 109 110 112 113 114 115 116 117 119 120 121 122 123

> cpmatch(control1, treat1)

 [1]   1   2   3   4   5   8   9  10  11  12  13  14  15  16  17  18  21  22  23  24  25  26  27  28  29  30  31  34  35  36  37  38  39  40  41  42  43
[38]  44  47  48  49  50  51  52  53  54  55  56  57  60  61  62  63  64  65  67  68  69  70  71  73  74  75  76  77  78  80  81  82  83  84  86  87  88
[75]  89  90  91  93  94  95  96  97  99 100 101 102 103 104 106 107 108 109 110 112 113 114 115 116 117

Benchmark

> microbenchmark::microbenchmark(cpmatch(control1, treat1), pmatch(control1, treat1))

Unit: microseconds
                      expr   min    lq    mean median    uq   max neval cld
 cpmatch(control1, treat1)  16.9  17.3  19.795  17.55  18.1  55.7   100  a 
  pmatch(control1, treat1) 174.5 174.8 187.174 175.20 188.5 421.9   100   b

score 0 · Answer 2 · answered Dec 05 '20 at 05:46

Perhaps there is a way to get the desired output from pmatch, but I have not been able to figure out how. I tried looking at the source code for the pmatch function here:

R-4.0.3\src\library\base\R\match.R

But was not able to make progress that way.

So, I wrote the following for-loop to apply to the output of pmatch and replace the NA's with the elements I wanted. It seems to work, at least for the example below.

my.vector <- c(1,  NA, NA, NA, NA,  8, NA, NA, NA, NA,
               NA, 14, NA, NA, NA, NA, 21, NA, NA, NA,
               NA, NA, 27, NA, NA, NA, NA, 34, NA, NA, NA, NA, NA)

desired.result <- c(1,   2,  3,  4,  5,  8,  9, 10, 11, 12,
                    13, 14, 15, 16, 17, 18, 21, 22, 23, 24,
                    25, 26, 27, 28, 29, 30, 31, 34, 35, 36, 37, 38, 39)

pos.not.na <- which(!is.na(my.vector))

if(any(is.na(my.vector)) == TRUE) {

     my.output <- my.vector

     for(i in 2:length(pos.not.na)) {

          my.output[pos.not.na[(i-1)]:(pos.not.na[i]-1)] <- seq(my.vector[pos.not.na[(i-1)]], 
          (my.vector[pos.not.na[(i-1)]] + (length(pos.not.na[(i-1)]:(pos.not.na[i]-1)) - 1)))

     }

     my.output[pos.not.na[length(pos.not.na)]:length(my.vector)] <- seq(my.vector[pos.not.na[length(pos.not.na)]],
          (my.vector[pos.not.na[length(pos.not.na)]] + length(pos.not.na[length(pos.not.na)]:length(my.vector)) - 1))

}

if(any(is.na(my.vector)) == FALSE) {my.output = my.vector}

my.output

all.equal(my.output, desired.result)
#[1] TRUE

I don't think you should use `pmatch`. It converts your inputs into characters and finds the partial match for each element. For example, try this `pmatch(1, c(20,19))` and you will see it returns 2 because it finds `"1"` in `"19"`. @MarkMiller — ekoam, Dec 05 '20 at 06:33
@ekoam Thank you. That is good to know. Can you suggest a better way to obtain the desired example output in the original post? I asked a similar question a few days ago: here https://stackoverflow.com/questions/65086390/position-of-elements-from-one-vector-in-another-vector-with-r and someone suggested I use `pmatch`. — Mark Miller, Dec 05 '20 at 06:48

score 0 · Answer 3 · answered Jul 27 '23 at 06:29

A possibility might be to use match with make.unique. Also using split or using Rcpp will be possible.

pm <- function(x, y) {
    a <- split(seq_along(x), x)
    b <- split(seq_along(y), y)[names(a)]
    b[lengths(b)==0] <- NA
    b <- unlist(Map(`length<-`, b, lengths(a)), FALSE, FALSE)
    `[<-`(b, unlist(a, FALSE, FALSE), b) }

Rcpp::sourceCpp(code=r"(
#include <Rcpp.h>
#include <unordered_map>
#include <queue>

using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector pmC(NumericVector a, NumericVector b) {
  IntegerVector idx(no_init(a.size()));
  std::unordered_map<float, std::queue<int> > lut;
  for(int i = 0; i < b.size(); ++i) lut[b[i]].push(i);
  for(int i = 0; i < idx.size(); ++i) {
    auto search = lut.find(a[i]);
    if(search != lut.end() && search->second.size() > 0) {
      idx[i] = search->second.front() + 1;
      search->second.pop();
    } else {idx[i] = NA_INTEGER;}
  }
  return idx;
}
)")

pm(control1, treat1)
pmC(control1, treat1)
match(make.unique(as.character(control1)), make.unique(as.character(treat1)))
# [1]   1   2   3   4   5   8   9  10  11  12  13  14  15  16  17  18  21  22  23
#[20]  24  25  26  27  28  29  30  31  34  35  36  37  38  39  40  41  42  43  44
#[39]  47  48  49  50  51  52  53  54  55  56  57  60  61  62  63  64  65  67  68
#[58]  69  70  71  73  74  75  76  77  78  80  81  82  83  84  86  87  88  89  90
#[77]  91  93  94  95  96  97  99 100 101 102 103 104 106 107 108 109 110 112 113
#[96] 114 115 116 117

pm(control2, treat2)
pmC(control2, treat2)
match(make.unique(as.character(control2)), make.unique(as.character(treat2)))
#  [1]   1   2   3   4   5   8   9  10  11  12  13  14  15  16  17  18  21  22
# [19]  23  24  25  26  27  28  29  30  31  34  35  36  37  38  39  40  41  42
# [37]  43  44  47  48  49  50  51  52  53  54  55  56  57  60  61  62  63  64
# [55]  65  67  68  69  70  71  73  74  75  76  77  78  80  81  82  83  84  86
# [73]  87  88  89  90  91  93  94  95  96  97  99 100 101 102 103 104 106 107
# [91] 108 109 110 112 113 114 115 116 117 119 120 121 122 123

Benchmark

bench::mark(cpmatch = cpmatch(control1, treat1),
            pmatch = pmatch(control1, treat1),
            pm = pm(control1, treat1),
            pmC = pmC(control1, treat1),
            match = match(make.unique(as.character(control1)), make.unique(as.character(treat1)))
            )
#  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#1 cpmatch     14.08µs   14.7µs    56785.    4.28KB     5.68  9999     1
#2 pmatch     137.06µs  137.6µs     7085.    4.55KB     0     3542     0
#3 pm         256.02µs  263.2µs     3497.   15.84KB     8.27  1692     4
#4 pmC          9.76µs   10.6µs    93517.    2.93KB     9.35  9999     1
#5 match       180.6µs  181.5µs     5115.    9.89KB     0     2558     0

Output of pmatch changes with vector length

3 Answers3