1

I would like to transform multiple binary columns into one column with categorical content based on the rownames of the binary columns. I have tried to use the code in this forum in other questions but as my case differs in that some rows have no entry, I have not managed to find an efficient solution. The dplyr solution proposed there is slow on my dataset. My dataset has over 2 Mio rows.

Here is the data

m <- matrix(0,10,10)
colnames(m) <- c("a","b","c","d","e","f","g","h","i","j")
m[3,2] <- 1
m[4,8] <- 1
m[5,8] <- 1
m[6,1] <- 1

looking like this

      a b c d e f g h i j
 [1,] 0 0 0 0 0 0 0 0 0 0
 [2,] 0 0 0 0 0 0 0 0 0 0
 [3,] 0 1 0 0 0 0 0 0 0 0
 [4,] 0 0 0 0 0 0 0 1 0 0
 [5,] 0 0 0 0 0 0 0 1 0 0
 [6,] 1 0 0 0 0 0 0 0 0 0
 [7,] 0 0 0 1 0 0 0 0 0 0
 [8,] 0 0 0 0 0 0 0 0 0 0
 [9,] 0 0 0 0 0 0 0 0 0 0
[10,] 0 0 0 0 0 0 0 0 0 0

I would like to get

      colname 
 [1,] "" 
 [2,] "" 
 [3,] "b"
 [4,] "h" 
 [5,] "h" 
 [6,] "a" 
 [7,] "d" 
 [8,] "" 
 [9,] "" 
[10,] "" 
S Front
  • 333
  • 1
  • 8

2 Answers2

5

This should be quick (borrowing a hint from Ronak and using ties.method = "first")

mc = max.col(m, ties.method = "first")
result = ifelse(m[cbind(1:nrow(m), mc)] == 0, "", colnames(m)[mc])
result
# [1] ""  ""  "b" "h" "h" "a" ""  ""  ""  "" 

In general, matrices are much faster than data frames. Converting your matrix to a data frame to use dplyr will be slow compared to all-matrix operations.

Here's another possibility:

nm = colnames(m)
apply(m, 1, function(x) if (any(x == 1)) nm[which.max(x)] else "")

The max.col solutions are very quick, especially Ronak's, taking my laptop less than 5 seconds on a 2M x 325 matrix:

## Generate data
nm = combn(letters, 2, FUN = paste, collapse = "")
nr = 2e6
nc = length(nm)

m = matrix(0, nrow = nr, ncol = nc)
n_ones = 1.5e6
ones = cbind(sample(1:nr, size = n_ones), sample(1:nc, size = n_ones, replace = TRUE))
m[ones] = 1

system.time(
  {r1 = apply(m, 1, function(x) if (any(x == 1)) nm[which.max(x)] else "")}
)
#  user  system elapsed 
# 13.94    3.87   19.50 

system.time({
  mc = max.col(m, ties.method = "first")
  r2 = ifelse(m[cbind(1:nrow(m), mc)] == 0, "", nm[mc])
})
# user  system elapsed 
# 3.56    0.01    3.72 

# Ronak's solution
system.time({
  cols <- max.col(m, ties.method = "first")
  vec <- colnames(m)[cols]
  vec[cols == 1 & m[, 1] != 1] <- ''
})
# user  system elapsed 
# 3.16    0.00    3.31 

The max.col solutions are much faster than apply, and Ronak's optimizations make it quite good.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
5

Another way using max.col could be to get index of max element in each row specifying ties.method = "first" and get the corresponding column names. We can then replace the columns names to blank which return max.col value as 1 (1st column) but does not have 1 in the first column of m.

cols <- max.col(m, ties.method = "first")
vec <- colnames(m)[cols]
vec[cols == 1 & m[, 1] != 1] <- ''
#[1] ""  ""  "b" "h" "h" "a" ""  ""  ""  "" 
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • 1
    Nice, only looking at the first column is probably a good bit faster than my more complex indexing. And avoiding `ifelse` is also good. – Gregor Thomas Apr 03 '20 at 01:08
  • Well, I'm actually shocked - the big speed-up is due to `ties.method = "first"`, that makes the expensive `max.col` operation about 5x faster. – Gregor Thomas Apr 03 '20 at 01:11
  • 1
    Maybe `ties.method = "first"` must be skipping few calculations which makes it faster. – Ronak Shah Apr 03 '20 at 01:13