An efficient way to indicate multiple indicator variables per row?

Question

Given an "empty" indicator dataframe:

Index    Ind_A    Ind_B
  1        0        0
  2        0        0
  3        0        0
  4        0        0

and a dataframe of values:

Index    Indicators
  1         Ind_A
  3         Ind_A
  3         Ind_B
  4         Ind_A

I want to end up with:

Index    Ind_A    Ind_B
  1        1        0
  2        0        0
  3        1        1
  4        1        0

Is there a way to do this without a for loop?

You might consider making a new question out of the situation described in your edit. — Frank, May 13 '15 at 19:29
@Frank Thanks I did http://stackoverflow.com/questions/30243920/an-efficient-way-to-indicate-multiple-indicator-variables-per-row-with-composite — lapolonio, May 14 '15 at 17:46

bgoldst · Answer 1 · 2015-05-14T01:45:01.917

indicator <- data.frame(Index=1:4,Ind_A=rep(0,4),Ind_B=rep(0,4));
values <- data.frame(Index=c(1,3,3,4),Indicators=c('Ind_A','Ind_A','Ind_B','Ind_A'));
indicator[cbind(match(values$Index,indicator$Index),match(values$Indicators,names(indicator)))] <- 1;
indicator;
##   Index Ind_A Ind_B
## 1     1     1     0
## 2     2     0     0
## 3     3     1     1
## 4     4     1     0

The most significant change in your edit is that indicator$Index now does not contain unique values (at least not on its own), so a simple match() from values$Index to indicator$Index is insufficient. Instead, we actually must run an outer() equality test on both Index and Index2 to get a matrix of logicals representing which rows in indicator each values row matches on both keys. Assuming the two-column composite key is unique, we can then calculate the row index in indicator from the linear (vector) index returned by which().

indicator[cbind((which(outer(values$Index,indicator$Index,`==`)&outer(values$Index2,indicator$Index2,`==`))-1)%/%nrow(values)+1,match(values$Indicators,names(indicator)))] <- 1;
indicator;
##   Index Index2 Ind_A Ind_B
## 1     1     10     1     1
## 2     1     11     1     0
## 3     2     10     0     1
## 4     2     12     1     0
## 5     3     10     1     0
## 6     3     12     1     0
## 7     4     10     1     1
## 8     4     12     1     0

Here's another solution using merge():

indicator[cbind(merge(values,cbind(indicator,row=1:nrow(indicator)))$row,match(values$Indicators,names(indicator)))] <- 1;
indicator;
##   Index Index2 Ind_A Ind_B
## 1     1     10     1     1
## 2     1     11     1     0
## 3     2     10     0     1
## 4     2     12     1     0
## 5     3     10     1     0
## 6     3     12     1     0
## 7     4     10     1     1
## 8     4     12     1     0

Performance

The first solution is more performant:

first <- function() indicator[cbind((which(outer(values$Index,indicator$Index,`==`)&outer(values$Index2,indicator$Index2,`==`))-1)%/%nrow(values)+1,match(values$Indicators,names(indicator)))] <<- 1;
second <- function() indicator[cbind(merge(values,cbind(indicator,row=1:nrow(indicator)))$row,match(values$Indicators,names(indicator)))] <<- 1;
N <- 10000;
system.time({ replicate(N,first()); });
##    user  system elapsed
##   2.032   0.000   2.041
system.time({ replicate(N,first()); });
##    user  system elapsed
##   2.047   0.000   2.038
system.time({ replicate(N,second()); });
##    user  system elapsed
##  12.578   0.000  12.592
system.time({ replicate(N,second()); });
##    user  system elapsed
##   12.64    0.00   12.66

can you look at my edit and answer that one as well? i tried using which with multiple conditions but that didn't help.... — lapolonio, May 13 '15 at 17:56

Frank · Answer 2 · 2015-05-13T16:27:55.433

1

I'd use matrices:

ind_mat <- as.matrix(ind_df[,-1]); rownames(ind_mat) <- ind_df[,1]
val_mat <- cbind(match(val_df$Index,ind_df[,1]),match(val_df$Indicators,names(ind_df[-1])))

ind_mat[val_mat] <- 1L
#   Ind_A Ind_B
# 1     1     0
# 2     0     0
# 3     1     1
# 4     1     0

You probably don't need "Index" as a column and can just put those as rownames. If (i) your value matrix is small relative to index matrix and (ii) your index column equals 1:nrow(ind_df), you should consider storing in a sparse matrix.

Regarding coercion to a matrix, it takes very little time and will avoid the hassle of having to coerce later for any matrix operations. Here's an example:

n    = 1e4
nind = 1e3
y    <- rnorm(n)
x    <- matrix(sample(0:1,size=n*nind,replace=TRUE),ncol=nind)
xd   <- data.frame(1:nrow(x),x)

# timing: 0.04 seconds on my computer
system.time(as.matrix(xd[,-1]))

# messiness, e.g., for OLS y~0+x: immense
solve(t(as.matrix(xd[,-1]))%*%as.matrix(xd[,-1]))%*%(t(as.matrix(xd[,-1]))%*%y)

The last line lets you avoid keeping a matrix around; I don't see the point.

edited May 13 '15 at 16:27

answered May 13 '15 at 15:30

Frank

66,179
8
96
180

2

I don't think there is much difference between the answers. Also, `as.matrix` could have an overhead. The biggest problem here is that you are not matching to the `Indx` column in the first data set, which doesn't have to be 1:4 necessarily. – David Arenburg May 13 '15 at 15:33
@DavidArenburg Okay, now it matches the index column. There's a pretty large difference between using a matrix and using a data.frame for matrix computations, anyway. Avoiding coercion and that. – Frank May 13 '15 at 15:39
Um, downvoter: I have corrected the "not matching" thing David mentioned. – Frank May 13 '15 at 15:39
Hmm. Ok, I'll upvote this, but I'd like to see some benchmarks that will prove what you saying (which will include the `as.matrix` overhead). – David Arenburg May 13 '15 at 15:41
You'll be coercing to a matrix later anyway if you do any matrix operations, like `t(as.matrix(ind_df[-1]))%*%as.matrix(ind_df[-1])` or similar. R's OLS implementation, in constructing a model matrix presumably also has to effectively coerce to a matrix or cbind the columns. – Frank May 13 '15 at 15:46
@DavidArenburg Best I can tell, `as.matrix` has very little overhead, but if you do it multiple times because you don't like having a matrix around, well, .... Anyway, for the case at hand, with Index=1:nrow(df) and all values being 0/1, the natural storage format is a matrix; anyone saying a data.frame is better should have to defend that proposition, not the other way around. And both a df and a matrix have memory overhead relative to a sparse matrix... – Frank May 13 '15 at 16:25
can you look at my edit and answer that one as well? i tried using which with multiple conditions but that didn't help.... – lapolonio May 13 '15 at 18:16
@blazinazin215 The analogue to my answer in that case would be an `array` with three dimensions. (Matrices are the special case of arrays with two dimensions.) I'm not super-confident that the approach in my answer is the best option for the case you're really facing. If your indices depart from being `1..n`, going to an array will just be wasteful (if the index is expanded from `10..12`, as in the example, to `1..12`) or confusing (if the index stays the same, with only three values, but those values are encoded only in `dimnames`). – Frank May 13 '15 at 18:30

Colonel Beauvel · Answer 3 · 2015-05-13T16:04:37.010

1

I would do directly:

df = transform(df, Index=factor(Index, level=min(Index):max(Index)))
as.data.frame.matrix(table(df))
#  Ind_A Ind_B
#1     1     0
#2     0     0
#3     1     1
#4     1     0

Data:

df = structure(list(Index = c(1, 3, 3, 4), Indicators = c("Ind_A", 
"Ind_A", "Ind_B", "Ind_A")), .Names = c("Index", "Indicators"
), row.names = c(NA, -4L), class = "data.frame")

edited May 13 '15 at 16:04

answered May 13 '15 at 15:44

Colonel Beauvel

30,423
11
47
87

An efficient way to indicate multiple indicator variables per row?

3 Answers3