2

I am trying to revert the indexing of a matrix in R. The following example illustrates my problem:

#sample data:

set.seed(21)
m <- matrix(sample(100,size = 100),10,10)

# sorting:

t(apply(m,1,order))

# new exemplary order after sorting:

       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    3    7   10    6    5    9    2    4    1     8
 [2,]    1    6    4    7    3    9    5    8    2    10
 [3,]    2    5    8   10    4    7    9    1    3     6
 [4,]    8    1    9    2    7    3    4    6   10     5
 [5,]    6    9    5    2    7    3   10    4    8     1
 [6,]    2    7    4    8    6    9    3   10    1     5
 [7,]    1    6    4   10    3    2    7    8    9     5
 [8,]    1    2    6    9    3   10    5    7    4     8
 [9,]    9    4    5    7   10    2    8    3    1     6
[10,]    6    8    4    3    2    1    5   10    7     9

# we can create m2 with the above sorting. We also add 1000 to all values
m2 <- t(apply(m,1,function(x){
x[order(x)] 
})) + 1000

# the next step would be to obtain the original arrangement of columns again, as described below.

After the sorting of my data we have the following situation: In row 1, the 3rd column (of matrix m2) is mapped to the original first column (of matrix m), the 7th column is mapped to the original second column, the 10th column to the original 3rd column, and so on.

My question is as follows: Can I somehow revert this mapping in R? What I mean by this is again for row 1, move the 1st column (of m2) to the position of the 3rd column (of m), then move the 2nd column to the position of the 7th, move the 3rd to the position of the 10th, and so on.

In the end what I try to achieve is to sort my data but save the existing arrangement of the columns somehow, so later, that means after some transformations of my data, I can rearrange them to the original ordering again. When I use the usual sorting algortihms in R, I am losing the old positioning of my columns. Of course most of the time you would not need those anymore, but atm I do need them.

user3032689
  • 627
  • 1
  • 10
  • 23
  • 2
    To make this reproducible, you need to set the seed (`set.seed()`), and recopy the output. Also, where is your code for the sorting you did? Etc. – gung - Reinstate Monica May 26 '16 at 15:57
  • @gung added both, ty. – user3032689 May 26 '16 at 16:01
  • yes because I used set.seed afterwards. I have updated everything now. – user3032689 May 26 '16 at 16:11
  • This still doesn't make much conceptual sense to me. It may help you to read: [Understanding the order() function](http://stackoverflow.com/q/2315601/1217536). – gung - Reinstate Monica May 26 '16 at 16:15
  • @gung ok well it's not so obvious. What I do is some kind of transformation (calculations etc) on the ordered data. Afterwards, I would like to set up the original order again, that is the one before the ordering. It is not possible to do this easily afaik. – user3032689 May 26 '16 at 17:29
  • your question is a bit vague. Do you want to apply changes (i.e. move the 1st column to the position of the 3rd column, then move the 2nd column to the position of the 7th, move the 3rd to the position of the 10th, and so on) to the original `m` matrix or a new one i.e. `t(apply(m,1,order))` – 989 May 26 '16 at 17:37
  • You are not working with the ordered data here. Did you compare the output of `t(apply(m,1,order))` to the original data `m`? You need to be clearer on what `order()` does. The linked thread will help you. – gung - Reinstate Monica May 26 '16 at 18:23
  • I edited the thread: It's not the original m, it's the ordered m + a random transformation (I added 1000 to all values as an example). I would like to have the original sorting of the columns in m with the new data of m2. – user3032689 May 26 '16 at 18:25
  • You *are not* working w/ `m`, nor are you working with "the ordered m" or "the ordered m + a random transformation". You need to be clear on what `order()` does. – gung - Reinstate Monica May 26 '16 at 19:26

2 Answers2

4

Background

I think it will help to examine the effect of the order() and rank() functions on a simple vector. Consider:

x <- c('c','b','d','b','a');
seq_along(x);
## [1] 1 2 3 4 5
order(x);
## [1] 5 2 4 1 3
rank(x); ## default is ties.method='average'
## [1] 4.0 2.5 5.0 2.5 1.0
rank(x,ties.method='first');
## [1] 4 2 5 3 1
rank(x,ties.method='last'); ## available from 3.3.0
## [1] 4 3 5 2 1
rank(x,ties.method='random'); ## we can ignore this one, obviously
## [1] 4 2 5 3 1
rank(x,ties.method='max');
## [1] 4 3 5 3 1
rank(x,ties.method='min');
## [1] 4 2 5 2 1

(I used character values to demonstrate that these principles and algorithms can apply to any (comparable) data type, not just numeric types. But obviously this includes numeric types.)

The order() function returns a vector that is the same length as the input vector. The order values represent a reordering of the input indexes (which are shown above courtesy of seq_along()) in such a way that when the input vector is indexed with the order vector, it will be sorted (according to the chosen sort method, which (if not explicitly overridden by a method argument), is radixsort for integer, logical, and factor, shellsort otherwise, and takes into account the collation order of the current locale for character values when not using radixsort). In other words, for an element of the result vector, its value gives the input index of the element in the input vector that should be moved to that position in order to sort it.

To try to put it even more plainly, an element of the order vector basically says "place the input vector element with this index in my position". Or, in a slightly more generic way (which will dovetail with the parallel description of rank()):

order element: the input vector element with this index sorts into my position.

In a sense, rank() does the inverse of what order() does. Its elements correspond to the elements of the input vector by index, and its values give a representation of the sort order of the corresponding input element (with tiebreaking behavior depending on the ties.method argument; this contrasts with order(), which always preserves the incoming order of ties, equivalent to ties.method='first' for rank()).

To use the same language structure that I just used for order(), which is the plainest manner of expression I can think of:

rank element: the input vector element in my position sorts into this index.

Of course, this description is only perfectly accurate for ties.method='first'. For the others, the destination index for ties will actually be the reverse of the incoming order (for 'last'), the lowest index of the duplicate set (for 'min'), the highest (for 'max'), the average (for 'average', which is actually the default), or random (for 'random'). But for our purposes, since we need to mirror the proper sort order as per order() (and therefore sort(), which uses order() internally), let's ignore the other cases from this point forward.


I've thought of one final way to articulate the behaviors of the order() and rank() functions: order() defines how to pull elements of the input vector into a sorted order, while rank() defines how to push elements of the input vector into a sorted order.

This is why indexing the input vector with the results of order() is the correct way to sort it. Indexing a vector is inherently a pulling operation. Each respective index vector element effectively pulls the input vector element that is stored at the index given by that index vector element into the position occupied by that index vector element in the index vector.

Of course, the "push vector" produced by rank() cannot be used in the same way as the "pull vector" produced by order() to directly sort the input vector, since indexing is a pull operation. But we can ask, is it in any way possible to use the push vector to sort the input vector? Yes, I've thought of how this can be done. The solution is index-assigning, which is inherently a push operation. Specifically, we can index the input vector with the push vector as the (lvalue) LHS and assign the input vector itself as the RHS.

So, here are the three methods you can use to sort a vector:

x[order(x)];
[1] "a" "b" "b" "c" "d"
sort(x); ## uses order() internally
[1] "a" "b" "b" "c" "d"
y <- x; y[rank(y,ties.method='first')] <- y; y; ## (copied to protect x, but not necessary)
[1] "a" "b" "b" "c" "d"

An interesting property of the rank() function with ties.method='first' is that it is idempotent. This is because, once you've produced a rank vector, ranking it again will not change the result. Think about it: say the first element ranks 4th. Then the first call will produce a 4 in that position. Running rank() again will again find that it ranks 4th. You don't even need to specify ties.method anymore for the subsequent calls to rank, because the values will have become distinct on the first call's (potential) tiebreaking.

rank(x,ties.method='first');
## [1] 4 2 5 3 1
rank(rank(x,ties.method='first'));
## [1] 4 2 5 3 1
rank(rank(rank(x,ties.method='first')));
## [1] 4 2 5 3 1
y <- rank(x,ties.method='first'); for (i in seq_len(1e3L)) y <- rank(y); y;
## [1] 4 2 5 3 1

On the other hand, order() is not idempotent. Repeatedly calling order() has the interesting effect of alternating between the push and pull vectors.

order(x);
## [1] 5 2 4 1 3
order(order(x));
## [1] 4 2 5 3 1
order(order(order(x)));
## [1] 5 2 4 1 3

Think about it: if the last element sorts 1st, then the first call to order() will pull it into the 1st position by placing its index (which is largest of all indexes) into the 1st position. The second call to order() will identify that the element in the 1st position is largest in the entire vector, and thus will pull index 1 into the last position, which is equivalent to ranking the last element with its rank of 1.


Solutions

Based on all of the above, we can devise 3 solutions to your problem of "desorting", if you will.

For input, let's assume that we have (1) the input vector x, (2) its sort order o, and (3) the sorted and possibly transformed vector xs. For output we need to produce the same vector xs but desorted according to o.

Common input:

x <- c('c','b','d','b','a'); ## input vector
o <- order(x); ## order vector
xs <- x[o]; ## sorted vector
xs <- paste0(xs,seq_along(xs)); ## somewhat arbitrary transformation
x;
## [1] "c" "b" "d" "b" "a"
o;
## [1] 5 2 4 1 3
xs;
## [1] "a1" "b2" "b3" "c4" "d5"

Method 1: pull rank()

Since the order and rank vectors are effectively inverses of each other (i.e. pull and push vectors), one solution is to compute the rank vector in addition to the order vector o, and use it to desort xs.

xs[rank(x,ties.method='first')];
## [1] "c4" "b2" "d5" "b3" "a1"

Method 2: pull repeated order()

Alternatively, instead of computing rank(), we can simply use a repeated order() call on o to generate the same push vector, and use it as above.

xs[order(o)];
## [1] "c4" "b2" "d5" "b3" "a1"

Method 3: push order()

I was thinking to myself that, since we already have the order vector o, we really shouldn't have to go to the trouble of computing another order or rank vector. Eventually I realized that the best solution is to use the pull vector o as a push vector. This accomplishes the desorting objective with the least work.

xs[o] <- xs;
xs;
## [1] "c4" "b2" "d5" "b3" "a1"

Benchmarking

library(microbenchmark);

desort.rank <- function(x,o,xs) xs[rank(x,ties.method='first')];
desort.2order <- function(x,o,xs) xs[order(o)];
desort.assign <- function(x,o,xs) { xs[o] <- xs; xs; };

## simple test case
x <- c('c','b','d','b','a');
o <- order(x);
xs <- x[o];
xs <- paste0(xs,seq_along(xs));

ex <- desort.rank(x,o,xs);
identical(ex,desort.2order(x,o,xs));
## [1] TRUE
identical(ex,desort.assign(x,o,xs));
## [1] TRUE

microbenchmark(desort.rank(x,o,xs),desort.2order(x,o,xs),desort.assign(x,o,xs));
## Unit: microseconds
##                     expr     min      lq      mean  median      uq     max neval
##    desort.rank(x, o, xs) 106.487 122.523 132.15393 129.366 139.843 253.171   100
##  desort.2order(x, o, xs)   9.837  12.403  15.66990  13.686  16.251  76.122   100
##  desort.assign(x, o, xs)   1.711   2.567   3.99916   3.421   4.277  17.535   100

## scale test case
set.seed(1L);
NN <- 1e4; NE <- 1e5; x <- sample(seq_len(NN),NE,T);
o <- order(x);
xs <- x[o];
xs <- xs+seq(0L,NE-1L)/NE;

ex <- desort.rank(x,o,xs);
identical(ex,desort.2order(x,o,xs));
## [1] TRUE
identical(ex,desort.assign(x,o,xs));
## [1] TRUE

microbenchmark(desort.rank(x,o,xs),desort.2order(x,o,xs),desort.assign(x,o,xs));
## Unit: milliseconds
##                     expr       min        lq     mean    median        uq       max neval
##    desort.rank(x, o, xs) 36.488185 37.486967 39.89157 38.613191 39.145405 85.849143   100
##  desort.2order(x, o, xs) 16.764414 17.262630 18.10341 17.443527 19.014296 28.338835   100
##  desort.assign(x, o, xs)  1.457014  1.498495  1.82893  1.527363  1.592151  4.255573   100

So, clearly the index-assignment solution is the best.


Demo

Below is a demonstration of how this solution can be used for your sample input.

I honestly think that a simple for-loop over the rows is preferable to an apply() call in this case, since you can modify the matrix in-place. If you need to preserve the sorted intermediate matrix, you can copy it before applying this desorting operation.

## generate input matrix
set.seed(21L); m <- matrix(sample(seq_len(100L)),10L); m;
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]   79   61    1   66   40   39    2   86   44    26
##  [2,]   25   84   49   35   67   32   36   70   50   100
##  [3,]   69    6   90   51   30   92   65   34   68    42
##  [4,]   18   54   72   73   85   75   55   15   27    77
##  [5,]   93   16   23   58    9    7   19   64    8    46
##  [6,]   88    4   60   13   98   47    5   29   56    80
##  [7,]   10   45   43   14   95   11   74   76   83    38
##  [8,]   17   24   57   82   63   28   71   87   53    59
##  [9,]   91   41   81   21   22   94   33   62   12    37
## [10,]   78   52   48   31   89    3   97   20   99    96

## sort each row, capturing sort order in rowwise order matrix
o <- matrix(NA_integer_,nrow(m),ncol(m)); ## preallocate
for (ri in seq_len(nrow(m))) m[ri,] <- m[ri,o[ri,] <- order(m[ri,],decreasing=T)];

## whole-matrix transformation
## embed row index as tenth digit, column index as hundredth (arbitrary)
m <- m+(row(m)-1L)/nrow(m)+(col(m)-1L)/ncol(m)/10;

## desort
for (ri in seq_len(nrow(m))) m[ri,o[ri,]] <- m[ri,]; m;
##        [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9]  [,10]
##  [1,] 79.01 61.03  1.09 66.02 40.05 39.06  2.08 86.00 44.04  26.07
##  [2,] 25.19 84.11 49.15 35.17 67.13 32.18 36.16 70.12 50.14 100.10
##  [3,] 69.22  6.29 90.21 51.25 30.28 92.20 65.24 34.27 68.23  42.26
##  [4,] 18.38 54.36 72.34 73.33 85.30 75.32 55.35 15.39 27.37  77.31
##  [5,] 93.40 16.46 23.44 58.42  9.47  7.49 19.45 64.41  8.48  46.43
##  [6,] 88.51  4.59 60.53 13.57 98.50 47.55  5.58 29.56 56.54  80.52
##  [7,] 10.69 45.64 43.65 14.67 95.60 11.68 74.63 76.62 83.61  38.66
##  [8,] 17.79 24.78 57.75 82.71 63.73 28.77 71.72 87.70 53.76  59.74
##  [9,] 91.81 41.84 81.82 21.88 22.87 94.80 33.86 62.83 12.89  37.85
## [10,] 78.94 52.95 48.96 31.97 89.93  3.99 97.91 20.98 99.90  96.92
bgoldst
  • 34,190
  • 6
  • 38
  • 64
  • thanks a lot for the detailed answer with solutions and benchmarking! – user3032689 May 27 '16 at 10:57
  • based on the discussion below on ties, do method 2 and 3 always work or only in my particular example? Because I acutally do transformations which are order dependent. I only wanted to keep it as simple as possible here. – user3032689 May 27 '16 at 12:07
  • 1
    Yes, all 3 solutions work on ties. The simple test case with `x <- c('c','b','d','b','a')` has a tie, and you can see all 3 solutions give the same result. – bgoldst May 27 '16 at 12:20
  • Yes I just understood that. Great! – user3032689 May 27 '16 at 12:24
  • 1
    Very impressive! Super-agree with the statement about using a for loop rather than apply in this case. (Though, note to OP, if performance *really* matters transpose your matrix in advance and operate on columns instead of rows). I'm surprised rank is that much slower than order - on large numeric data looks like `rank(x, tie = 'first')` is a bit slower than `order(order(x))`. – Gregor Thomas May 27 '16 at 16:50
2

rank is the complement to order(). You need to save the original rank() and you can use that to get back to the original ordering after rearranging with order().

I think your example is overcomplicated (far from minimal!) by putting things in a matrix and doing extra stuff. Because you are applying functions at the row-level you just need to solve it for a vector. An example:

set.seed(47)
x = rnorm(10)
xo = order(x)
xr = rank(x)
x[xo][xr] == x
# [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

In your case, you can perform whatever transformations you want on the ordered vector x[xo], then index the result by [xr] to get back to the original ordering.

sorted_result = x[xo] + c(1, diff(x[xo])) # some order-dependent transformation
final_result = sorted_result[xr] # back to original ordering 

If there's a possibility of ties, you'll want to use ties.method = 'first' in the rank() call.

Taking this back to the matrix example:

m3 = t(apply(m, 1, function(x) {
    xo = order(x)
    xr = rank(x, ties.method = 'first')
    (x[xo] + 1000)[xr] # add 1000 to sorted matrix and then "unsort"
}))

# check that it worked
all(m3 == (m + 1000))
# [1] TRUE
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Why do you need `ties.method='first'`? Based on my thinking, the default behavior of `ties.method='average'` will produce an average rank value for each of the duplicates within the set of duplicates, which, when used to index the sorted vector, will truncate to an integer which will still serve as a valid index to one of the elements in the duplicate set. I think it will work without `ties.method='first'`. – bgoldst May 26 '16 at 19:10
  • 1
    @bgoldst On the trivial transformation of `+1000` you're correct - but this a case where the ordering and reordering doesn't matter anyway. If the transformation depends on the order (like my `z + c(1, diff(z)` example) then equal input values in different positions will get mapped to different results. The `'average'` tie method would copy the first result for every occurrence; the `'first'` tie method will fill results in the order they occurred. Perhaps you could find a use-case for either, but the `'first'` method seems like the behavior I would expect by default. – Gregor Thomas May 26 '16 at 19:38
  • Ah, I see. Good thinking @Gregor. – bgoldst May 26 '16 at 19:50
  • @Gregor Thank you as well for your answer! – user3032689 May 27 '16 at 10:57