2

I have a JxK dataframe M and I want to calculate the following.

  1. For each row j, the value k that minimizes M[j,k]
  2. For each column k, the value j that minimizes M[j,k]

Then, let the values satisfying the first be vector A_j and the second be vector A_k. Then, I need two vectors. Let vector C be the vector sort(c(A_j, A_k)).

  1. A vector of length equal to A_j where element i is the index of element A_j[i] in the combined and sorted vector C.
  2. A vector of length equal to A_k where element i is the index of element A_k[i] in the combined and sorted vector C.

For both of the two sorted vectors mentioned above, all ties should be given the first index at which that value appeared in vector C. That is, if A_j[i] and A_j[i+1] are equal, then element i and element i + 1 in the vector that satisfies condition #3 should both equal A_j[i]'s position in the sorted vector C.

As always, this is not hard to do inefficiently. However, in practice, the dataframe is very big, so inefficient solutions fail.

As a proof of concept, one solution would be as follows.

# Create the dataframe
set.seed(1)
df <- data.frame(matrix(rnorm(50, 8, 2), 10)) # A 10x5 matrix

# Calculate 1 and 2
A.j <- apply(df, 1, min) 
A.k <- apply(df, 2, min)

# Calculate 3 and 4
C <- sort(unname(c(A.j, A.k)))

A.j.indices <- apply(df, 1, function(x) which(x == min(x)))
A.k.indices <- apply(df, 2, function(x) which(x == min(x)))

vec3out <- c()
vec4out <- c()

for(j in 1:nrow(df)){
   rank <- which(C == A.j[j])[1] 
   vec3out <- c(vec3out, rank)
}

for(k in 1:ncol(df)){
   rank <- which(C == A.k[k])[1] 
   vec4out <- c(vec4out, rank)
}
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • 3
    Please post a minimum example that demonstrates one of these "inefficient approaches" -- it's important for verifying more efficient solutions work properly and for benchmarking runtimes. – josliber Mar 25 '14 at 22:08
  • have you been using `which.min` and `which.max`? seems straight-forward and efficient – rawr Mar 25 '14 at 22:18
  • 1
    @politicaleconomist `match` from the posted solution definitely looks like what you're looking for. Your solution posted is in the second circle of the [R Inferno](http://www.burns-stat.com/pages/Tutor/R_inferno.pdf). Check it out -- it's a good read! – josliber Mar 25 '14 at 22:48

1 Answers1

2

For starters, you should use a matrix. Data.frames are less efficient (Should I use a data.frame or a matrix?). Then, we should use apply functions.

Let M be your data.frame coerced to a matrix.

M <- as.matrix(M)

minByRow <- apply(M, MARGIN=1, FUN=which.min)
minByCol <- apply(M, MARGIN=2, FUN=which.min)

combinedSorted <- sort(c(minByRow, minByCol))

byRowOutput <- match(minByRow, combinedSorted)
byColOutput <- match(minByCol, combinedSorted)

Here are the results for 1 million observations of 100 variables:

M <- matrix(data=rnorm(100000000), nrow=1000000, ncol=100)


system.time({
  minByRow <- apply(M, MARGIN=1, FUN=which.min)
  minByCol <- apply(M, MARGIN=2, FUN=which.min)

  combinedSorted <- sort(c(minByRow, minByCol))

  byRowOutput <- match(minByRow, combinedSorted)
  byColOutput <- match(minByCol, combinedSorted)
})

   user  system elapsed 
   7.37    0.46    7.93 
Community
  • 1
  • 1
stanekam
  • 3,906
  • 2
  • 22
  • 34