3

I have these loops :

xall = data.frame()
for (k in 1:nrow(VectClasses))
{
for (i in 1:nrow(VectIndVar))
  {
   xall[i,k] = sum(VectClasses[k,] == VectIndVar[i,])
  }
}

The data:

VectClasses = Data Frame containing the characteristics of each classes

VectIndVar = Data Frame containing each record of the data base

The two for loops work and give an output I can work with, however, it takes too long, hence my need for the apply family

The output I am looking for is as this:

    V1 V2 V3 V4
 1  3  3  2  2
 2  2  2  1  1
 3  3  4  3  3
 4  3  4  3  3
 5  4  4  3  3
 6  3  2  3  3

I tried using :

xball = data.frame()
xball = sapply(xball, function (i,k){
 sum(VectClasses[k,] == VectIndVar[i,])})

xcall = data.frame()
xcall = lapply(xcall, function (i, k){sum(VectClasses[k,] == VectIndVar[i,]} )

but neither seems to be filling the dataframe

reproductible data (shortened):

VectIndVar <- data.frame(a=sample(letters[1:5], 100, rep=T), b=floor(runif(100)*25), 
 c = sample(c(1:5), 100, rep=T), 
 d=sample(c(1:2), 100, rep=T))

and :

> K1 = 4
VectClasses= VectIndVar [sample(1:nrow(VectIndVar ), K1, replace=FALSE), ]

Can you help me?

Amandine FAURILLOU
  • 528
  • 1
  • 8
  • 23
  • 2
    Please post the sample data. – user227710 May 26 '15 at 13:56
  • 2
    I added the data, and fixed the expression at the end – Amandine FAURILLOU May 26 '15 at 14:07
  • 2
    probably if you pre-allocate your output object (I would use a matrix), your loops would run significantly faster. `xall <- matrix(NA, ncol = nrow(VectClasses), nrow = nrow(VectIndVar))` and then run the loops as you have (without the `xall = data.frame()` line) – rawr May 26 '15 at 14:19
  • 1
    Better than an illustration of sample data is a reproducible example, where the output you want matches the input you provide. Here's a reference: http://stackoverflow.com/a/28481250/1191259 This is particularly important if you want to ask about speed/performance. Anyway, just fyi. – Frank May 26 '15 at 14:23
  • but depending on the size of your data, `==` is probably the bottleneck – rawr May 26 '15 at 14:27
  • Thanks for adding data we can use. I have a couple further nitpicks, though: The data is not "reproducible" in a certain sense because we will each get different random results when running the code. The solution for that is the `set.seed` function. Also, under "the output I am looking for", it would be nice if that also corresponded to the example data. – Frank May 26 '15 at 16:53
  • @rawr if I used `identical` instead of the `==` statement, would that unjam the bottleneck, so to speak? – Amandine FAURILLOU May 26 '15 at 17:10
  • @Didine34790 I actually tested that before I posted, and it seems to be a bit slower. I suppose the real speed hog is that you're testing every element of every vector for exact equality. – rawr May 26 '15 at 17:15
  • @rawr in the same spirit, would it be faster if I tested if they were different? ie not exact or even remote equality? – Amandine FAURILLOU May 26 '15 at 17:54
  • 1
    @Didine34790 what about this, I used your example data, set n=100,000 instead of 100. `ivar <- VectIndVar[rep(1:nrow(VectIndVar), nrow(VectClasses)), ]; vclass <- VectClasses[rep(1:nrow(VectClasses), each = nrow(ivar) / 4), ]; matrix(rowSums(vclass == ivar), ncol = nrow(VectClasses))` this ran in 1.5 seconds on my laptop. the `outer` solution is still running. and I don't want to try the for loop :} edit: `outer` took 4.5 minutes – rawr May 26 '15 at 20:00
  • @rawr Thank you, I don't understand how it works, (why /4?) but it does, and it's really fast. Faster than Franck's solution. For `n=500000`, `system.time` returns for the loop : 93.88; for Franck's solution 1.39, and for yours 0.59 – Amandine FAURILLOU May 27 '15 at 07:06
  • @rawr is the `each = nrow(ivar) / 4` refer to the fact that `K1 = 4`? – Amandine FAURILLOU May 27 '15 at 08:34
  • 1
    @yes I suppose I should have used `K1` or `nrow(VectClasses)` which would have made more sense. I was just trying to create two new data frames dynamically from your starting point and making sure to get the order needed to have the results in the same order as your loop. both new data frames needed to be of the same dimensions in order to use `==`. And the speed up would most likely be attributed to summing once and using `==` once – rawr May 27 '15 at 12:53

1 Answers1

6

I would use outer instead of *apply:

res <- outer( 
  1:nrow(VectIndVar), 
  1:nrow(VectClasses),
  Vectorize(function(i,k) sum(VectIndVar[i,-1]==VectClasses[k,-1]))
)

(Thanks to this Q&A for clarifying that Vectorize is needed.)

This gives

> head(res) # with set.seed(1) before creating the data
     [,1] [,2] [,3] [,4]
[1,]    1    1    2    1
[2,]    0    0    1    0
[3,]    0    0    0    0
[4,]    0    0    1    0
[5,]    1    0    0    1
[6,]    1    1    1    1

As for speed, I would suggest using matrices instead of data.frames:

cmat <- as.matrix(VectClasses[-1]); rownames(cmat)<-VectClasses$a
imat <- as.matrix(VectIndVar[-1]);  rownames(imat)<-VectIndVar$a
Community
  • 1
  • 1
Frank
  • 66,179
  • 8
  • 96
  • 180
  • 1
    I have no previous knowledge of the outer function, so I have zero idea how it works, from the code you wrote, I get this error : `Error in match.fun(FUN) : '1:nrow(VectClasses)' is not a function, a character string or a symbol ` – Amandine FAURILLOU May 26 '15 at 14:17
  • @Didine34790 Oh, sorry. I had the arguments in the wrong order. Fixed. – Frank May 26 '15 at 14:19
  • 1
    @Didine34790 I've made another change. Does it work with the `Vectorize` version? – Frank May 26 '15 at 16:55
  • 1
    @Franck time-wise, it's 86 times faster (with matrixes, it's slower with data.frames) – Amandine FAURILLOU May 27 '15 at 06:46