0

@Moody_Mudskipper and I were having a discussion of the relative merits of negative integer indexing and positive integer indexing. Consider a large matrix with 1 row:

m <- matrix(integer(2^20), 1)

Is a negative or positive index vector a more (time) efficient way to get all but the last column?

Our candidates so far are

neg <- function(m) m[,-ncol(m)]
pos <- function(m) m[,seq_len(ncol(m) - 1)]

How about for a vector: v <- integer(2^28)

neg_v <- function(v) v[-length(v)]
pos_v <- function(v) v[seq_len(length(v) - 1)]

If there is a benefit to one approach, does it carry over to the more general case, where you want everything but some arbitrary index, i?

We were inspired, in part, by the discussion of the most efficient way to get the last element, here

De Novo
  • 7,120
  • 1
  • 23
  • 39

2 Answers2

2

[ is very versatile. You pay for that with a small performance cost. I would still encourage you to use it for this task. If performance of this step in your algorithm is critical to you, you are using the wrong language.

Alternative for the vector:

len_v <- function(v) `length<-`(v, length(v) - 1)

For the matrix (note this doesn't drop dimensions):

dim_m <- function(m) matrix(`length<-`(m, length(m) - nrow(m)), nrow = nrow(m))

benchmarks:

[1] 10
Unit: nanoseconds
     expr  min   lq     mean median   uq     max neval cld
 neg_v(v) 3841 4098  4747.78   4353 4609   36363   100   a
 pos_v(v) 4866 5378  5572.30   5633 5634    7426   100   a
   neg(m) 4353 4866  5165.02   5122 5378    6914   100   a
   pos(m) 5121 5378  5792.41   5634 5890   15364   100   a
 len_v(v)  768 1024  1065.31   1024 1280    2049   100   a
 dim_m(m) 2817 3329 24219.64   3585 3841 2063960   100   a

<snip>

[1] 20
Unit: microseconds
     expr      min       lq      mean    median       uq        max neval cld
 neg_v(v) 3125.901 3210.406 5997.0644 4080.9305 4726.878 102998.274   100   b
 pos_v(v) 4510.752 4572.721 6308.3921 5097.8025 5980.619 104768.516   100   b
   neg(m) 3369.172 3438.952 6286.7153 4657.4825 4976.423 102606.735   100   b
   pos(m) 4499.996 4518.306 4895.5423 4534.5670 5059.904   6577.785   100  ab
 len_v(v)  562.852  591.020  859.3246  630.3275  680.006   2381.493   100  a 
 dim_m(m) 1072.184 1120.070 3528.7498 1167.1880 2377.523  99962.255   100  ab
Roland
  • 127,288
  • 10
  • 191
  • 288
  • That's quite nice! It seems obvious once you see it, but of course, wasn't obvious at all! I'm keeping it open for a while to see what else we get, but so far this looks like a winner. – De Novo Mar 27 '18 at 12:10
  • I don't know if this demonstrates a "performance cost" for the `[`operator. You haven't implemented indexing using a less flexible but more efficient method, you've approached the specific problem with a different computational strategy all together (a very clever approach). – De Novo Mar 27 '18 at 19:44
  • Interpreting the question as: what's the most time efficient way to get all but the last column, this is the answer. – De Novo Mar 28 '18 at 18:38
1

Neither solution is independent of the size of the input. The positive index vector wins as the size of the object gets larger. This might be a little surprising, since we presumed the negative integer index sent things to C immediately, and the positive integer index made one additional R call in seq_len, but that is a very fast primitive. Note, though, that we're not even talking about a factor of 2.

For the general case, though, where the positive index vector has to be built with two seq calls, even the primitives seq_len and seq.int, any time efficiency benefit of the positive index is lost.

Here's the microbenchmark data for our initial candidates, edited to include @DavidArenburg's suggestion, head(v, -1) for comparison, which works for the vector question, but not the matrix question:

for (size in 10:20){
  v <- integer(2^size)
  m <- matrix(v, 1)
  print(size)
  print(microbenchmark(neg_v(v), pos_v(v), head(v, -1), neg(m), pos(m)))
}


[1] 10
Unit: microseconds
        expr    min      lq     mean  median      uq    max neval cld
    neg_v(v)  7.665  7.9435  8.41404  8.1680  8.4995 20.977   100 a  
    pos_v(v)  7.684  7.9900  8.94586  8.2090  8.4895 57.970   100 ab 
 head(v, -1) 18.371 19.2280 20.77184 19.7155 20.6470 58.672   100   c
      neg(m)  9.013  9.3440  9.99337  9.6675 10.0070 23.603   100  b 
      pos(m)  8.457  8.9640  9.36145  9.1965  9.5035 16.494   100 ab 
[1] 11
Unit: microseconds
        expr    min      lq     mean  median      uq    max neval cld
    neg_v(v) 13.905 14.2010 15.36906 14.3640 14.6355 25.910   100 a  
    pos_v(v) 14.140 14.4275 15.83083 14.5965 14.8975 55.185   100 ab 
 head(v, -1) 25.068 25.8605 28.92266 26.6320 27.6395 79.200   100   c
      neg(m) 15.769 16.1215 17.52703 16.3175 16.7270 51.691   100  b 
      pos(m) 15.149 15.5375 16.68138 15.7605 16.0315 49.755   100 ab 
[1] 12
Unit: microseconds
        expr    min      lq     mean  median      uq     max neval cld
    neg_v(v) 26.190 27.3690 31.18507 29.4325 31.4510  48.557   100  a 
    pos_v(v) 25.296 28.4645 35.51150 29.3030 32.1255 421.324   100  a 
 head(v, -1) 37.928 39.8455 48.19410 40.9510 46.2560 375.465   100   b
      neg(m) 29.437 31.3505 35.00288 32.6420 35.0445  69.953   100  a 
      pos(m) 26.738 28.9945 37.35408 29.4220 34.1825 462.919   100  ab
[1] 13
Unit: microseconds
        expr    min      lq      mean  median      uq      max neval cld
    neg_v(v) 46.506 51.6525  66.96329 56.8475 58.5865  442.329   100   a
    pos_v(v) 48.263 56.0625  70.38045 56.4500 58.1370  498.686   100   a
 head(v, -1) 58.836 66.9540  75.79951 68.0400 71.4055  179.082   100   a
      neg(m) 51.101 57.8895 153.67522 61.9160 63.4910 8280.099   100   a
      pos(m) 49.914 55.1205  65.61441 55.7930 60.3385  395.184   100   a
[1] 14
Unit: microseconds
        expr     min       lq     mean   median       uq      max neval cld
    neg_v(v)  91.234 111.0975 201.0988 112.6285 114.5950 8577.642   100   a
    pos_v(v)  95.394 110.4745 213.2071 111.5600 113.9150 7812.403   100   a
 head(v, -1) 107.164 122.1795 146.4730 123.7635 127.4225  567.144   100   a
      neg(m) 101.615 120.1835 141.3496 121.6715 124.0385  564.711   100   a
      pos(m)  98.133 106.4320 115.3708 108.5990 109.5940  563.723   100   a
[1] 15
Unit: microseconds
        expr     min       lq     mean   median       uq       max neval cld
    neg_v(v) 183.586 572.8205 752.0629 590.8605 668.0415 14919.153   100   a
    pos_v(v) 207.188 466.6390 605.7880 503.4750 541.7855  8832.833   100   a
 head(v, -1) 210.091 512.0090 540.2439 537.2850 563.8335   766.843   100   a
      neg(m) 219.635 594.0065 689.7076 610.0625 632.1530  7904.021   100   a
      pos(m) 214.892 400.7375 523.8320 412.9320 485.0875  8831.092   100   a
[1] 16
Unit: microseconds
        expr     min        lq      mean    median        uq       max neval cld
    neg_v(v) 363.163 1024.6745 1326.5858 1046.4375 1120.8760 10668.002   100   b
    pos_v(v) 380.258  884.8680  981.2985  897.1980  960.6395  6879.475   100  a 
 head(v, -1) 388.339  916.7945  970.3009  937.1235  966.2430  1707.400   100  a 
      neg(m) 394.158 1063.4785 1207.5574 1081.2935 1279.0390  7388.591   100  ab
      pos(m) 386.093  724.6455  920.7682  733.4520  829.8680  7127.476   100  a 
[1] 17
Unit: microseconds
        expr     min       lq     mean   median       uq        max neval cld
    neg_v(v) 734.334 2011.042 2156.787 2054.732 2209.009  10029.207   100   a
    pos_v(v) 761.919 1745.547 1828.906 1772.859 1883.327   7059.740   100   a
 head(v, -1) 791.387 1786.015 1940.802 1822.699 1955.429   8123.136   100   a
      neg(m) 794.872 1968.399 3905.147 2109.351 2315.058 153287.378   100   a
      pos(m) 767.494 1421.462 1516.872 1440.947 1556.770   7150.461   100   a
[1] 18
Unit: milliseconds
        expr      min       lq     mean   median       uq       max neval cld
    neg_v(v) 1.481940 2.240029 4.140381 4.110589 4.528180 11.454875   100   b
    pos_v(v) 1.560234 1.985711 3.588765 3.484492 3.662266 10.422674   100   b
 head(v, -1) 1.588862 2.034140 3.484662 3.544228 3.805630  9.757330   100  ab
      neg(m) 1.597898 2.120536 3.583514 3.937739 4.329667  9.443408   100   b
      pos(m) 1.550572 1.891935 2.832852 2.846605 3.088712  8.821962   100  a 
[1] 19
Unit: milliseconds
        expr      min       lq     mean   median       uq       max neval cld
    neg_v(v) 3.252966 4.080474 6.994025 4.323296 6.660327 160.33926   100   a
    pos_v(v) 3.144045 3.923489 5.330351 4.050275 7.129475  12.49244   100   a
 head(v, -1) 3.233204 3.989537 5.661008 4.246967 7.526656  12.98025   100   a
      neg(m) 3.507776 4.320612 7.669867 4.652584 8.542794 157.08516   100   a
      pos(m) 3.114030 3.685220 6.276472 3.819997 4.896968 156.17061   100   a
[1] 20
Unit: milliseconds
        expr      min        lq     mean    median       uq      max neval cld
    neg_v(v) 8.089944 12.161850 18.88741 13.185043 14.51066 171.9262   100   a
    pos_v(v) 7.918274  8.419380 16.07148 11.899635 13.33752 173.2544   100   a
 head(v, -1) 7.938086  8.478848 17.49432 12.432816 13.54419 165.1829   100   a
      neg(m) 8.645179 11.077800 20.55285 13.502900 14.43028 171.8824   100   a
      pos(m) 7.334725  7.565886 14.32296  8.394255 12.04803 166.6472   100   a

general case

neg_mid <- function(m, i) m[,-i]
pos_mid <- function(m, i) m[c(seq_len(i-1), seq.int(i + 1, ncol(m)))]
neg_mid_v <- function(v, i) v[-i]
pos_mid_v <- function(v, i) v[c(seq_len(i - 1), seq.int(i + 1, length(v)))]

for (size in 10:20){
  v <- integer(2^size)
  m <- matrix(v, 1)
  i <- sample(2^size, 1)
  print(size)
  print(microbenchmark(neg_mid(m, i), pos_mid(m, i), neg_mid_v(v, i), pos_mid_v(v, i)))
}

[1] 10
Unit: microseconds
            expr    min      lq     mean  median      uq      max neval cld
   neg_mid(m, i)  8.256  8.6450 23.24169  8.9035  9.2690 1398.011   100   a
   pos_mid(m, i) 11.559 11.8745 42.67081 12.3090 12.5665 2985.724   100   a
 neg_mid_v(v, i)  7.262  7.6350 25.55930  7.9745  8.3010 1706.966   100   a
 pos_mid_v(v, i) 11.162 11.6665 43.02460 11.9085 12.1270 3076.268   100   a

<snip>

[1] 20
Unit: milliseconds
            expr       min        lq     mean   median       uq      max neval cld
   neg_mid(m, i)  8.399471  9.239935 17.68447 13.61378 14.67457 183.7487   100   a
   pos_mid(m, i) 11.217084 12.215356 19.00080 16.47771 17.76067 186.3762   100   a
 neg_mid_v(v, i)  7.832412 10.067575 15.94740 13.14343 14.61393 187.9892   100   a
 pos_mid_v(v, i) 10.890024 13.340772 20.99761 16.47525 18.06857 195.5897   100   a
  • an aside: in the additional special case, where you're looking for all but the first row, the positive index vector still wins out over the negative for time efficiency as the input becomes large.
De Novo
  • 7,120
  • 1
  • 23
  • 39