How to efficiently sum over levels defined in another variable?

Question

I am new to R. Now I have a function as follow:

funItemAverRating = function()
{
    itemRatingNum = array(0, itemNum);
    print("begin");
    apply(input, 1, function(x)
        {
            itemId = x[2]+1;
            itemAverRating[itemId] <<- itemAverRating[itemId] + x[3];
            itemRatingNum[itemId] <<- itemRatingNum[itemId] + 1;
        }
    );
}

In this function input is a n*3 data frame, n is ~6*(10e+7), itemRatingNum is a vector of size ~3*(10e+5).
My question is why the apply function is so slow (it would take nearly an hour to finish)? Also, as the function runs, it uses more and more memory. But as you can see, the variables are all defined outside the apply function. Can anybody help me?

cheng

Hard to say without seeing what data looks like (what is `itemAverRating`, what are columns of `input`) but I suppose you could to it without `apply` using vectorization. E.g.: `itemRatingNum[input[[2]]+1] <- itemRatingNum[input[[2]]+1] + 1` — Marek, May 17 '11 at 10:17
Thanks for you answer. Is there any efficiency difference between this and the apply function? — user572138, May 17 '11 at 10:19
Yes. Operating on vectors if much, much faster (could take you from 1h to <1m) — Marek, May 17 '11 at 10:28
@user572138, please change the title of your question. Apply is not slow in general, only in your particular reason, mainly because you are not using it right. — mpiktas, May 17 '11 at 10:34
@mpiktas Ok. Can you provide an scenario where apply process large scale data (such as in my case) efficiently? Thanks. — user572138, May 17 '11 at 11:03
@user572138, `apply(x,1,mean)` probably will work fine. Note that I said in general, and your example is specific, i.e. with large scale data. If the data does not fit in memory base R should be used with care. Also I am a bit confused, if you knew that `apply` is slow, why did you use it? — mpiktas, May 17 '11 at 11:07
@mpiktas : I'd use `rowMeans()` for that, which is again a whole lot faster. Apply is about as fast as a for-loop, see also : http://stackoverflow.com/questions/2275896/is-rs-apply-family-more-than-syntactic-sugar — Joris Meys, May 17 '11 at 11:23
@Joris, yes I know, I try to keep myself up to date with speed improvements in R, but the OP asked the example with `apply`. — mpiktas, May 17 '11 at 11:41
@user572138 : edited the title to point out the real question. — Joris Meys, May 17 '11 at 12:46

score 7 · Accepted Answer · edited May 17 '11 at 10:29

7

It's slow because you call high-level R functions many times.

You have to vectorize your function, meaning that most operations (like <- or +1) should be computed over all data vectors.

For example it looks to me that itemRatingNum holds frequencies of input[[2]] (second column of input data.frame) which could be replaced by:

tb <- table(input[[2]]+1)
itemRatingNum[as.integer(names(tb))] <- tb

edited May 17 '11 at 10:29

David Heffernan

601,492
42
1,072
1,490

answered May 17 '11 at 10:25

Marek

49,472
15
99
121

Thanks for your answer. But if I want to do things like: itemPopu = tapply(input[,3], input[,2], sum); Is there any efficient solutions. I find that tapply is very slow. – user572138 May 17 '11 at 11:37
2

Try `rowsum(input[[3]],input[[2]])` – Marek May 17 '11 at 12:13

Joris Meys · Answer 2 · 2011-05-17T12:57:13.330

7

Don't do that. You're following a logic that is completely not R-like. If I understand it right, you want to add to a certain itemAverRating vector a value from a third column in some input dataframe.

What itemRatingNum is doing, is rather obscure. It does not end up in the global environment, and it just becomes a vector filled with frequencies at the end of the loop. As you define itemRatingNum within the function, the <<- assignment will also assign it within the local environment of the function, and it will get destroyed when the function ends.

Next, you should give your function input, and get some output. Never assign to the global environment if it's not necessary. Your function is equivalent to the - rather a whole lot faster - following function, which takes input and gives output :

funItemAverRating = function(x,input){
    sums <- rowsum(input[,3],input[,2])
    sumid <- as.numeric(rownames(sums))+1
    x[sumid]+c(sums)
}

FUNCTION EDITED PER MAREKS COMMENT

Which works like :

# make data
itemNum <- 10
set.seed(12)
input <- data.frame(
    a1 = rep(1:10,itemNum),
    a2 = sample(9:0,itemNum*10,TRUE),
    a3 = rep(10:1,itemNum)
)
itemAverRating <- array(0, itemNum)
itemAverRating <- funItemAverRating(itemAverRating,input)
itemAverRating
 0  1  2  3  4  5  6  7  8  9 
39 65 57 36 62 33 98 62 60 38

If I try your code, I get :

> funItemAverRating()
[1] "begin"
...
> itemAverRating
 [1] 39 65 57 36 62 33 98 62 60 38

Which is the same. If you want itemRatingNum, then just do :

> itemRatingNum <- table(input[,2])
 0  1  2  3  4  5  6  7  8  9 
 6 11 11  8 10  6 18  9 13  8

edited May 17 '11 at 12:57

answered May 17 '11 at 11:06

Joris Meys

106,551
31
221
263

I tried tapply, but I found that the this function is very slow, itemPopu = tapply(input[,3], input[,2], sum); this code would cost a lot of time. Is there any better solutions? – user572138 May 17 '11 at 11:34
@user572138 : It's about 13 times faster than your code on my computer and it does exactly the same thing. What do you mean by "slow"? – Joris Meys May 17 '11 at 11:39
In my data, length(input) is very large(~6*10e+7), but there are many repeated items in input[,2]. The unique number of input[,2] is ~3*10e5. When i run tapply(input[,3], input[,2], sum) i need to wait a long time(at least 5 mins). In C, this of course will not cost so long. – user572138 May 17 '11 at 11:46
By "at least 5 mins" I mean that after 5 mins, the code still runs. – user572138 May 17 '11 at 11:48
The row number of input is about (~6*10e+7). – user572138 May 17 '11 at 11:50
@user572138 : What do you expect? You have a huge dataset, and you expect it to be processed as if you have only 1000 datapoints? Quite impossible. I tried it with 1e7 length input and 1e5 different items, and it runs in 11 seconds (compared to 315 seconds for your code). But at one point, you'll pay for the huge dataset you have. – Joris Meys May 17 '11 at 11:57
@Joris You forgot about `rowsum` ;) – Marek May 17 '11 at 12:20
@user572138 : edited the function using rowsum (thx @Marek), now it runs on an input with 1e7 rows and 1e5 different items in only 2.3 seconds (instead of 11) on my computer. – Joris Meys May 17 '11 at 12:26
@user572138 : corrected a mistake in the code. Function as it is now works perfectly for every case. – Joris Meys May 17 '11 at 12:59
Thanks all of you for the great help. – user572138 May 17 '11 at 16:49

How to efficiently sum over levels defined in another variable?

2 Answers2