3

I cannot find a satisfying tutorial that would explain me how to use all the possibilities of apply functions. I'm still a newbie but this could often come in handy and significantly simplify my code. So here's my example... I've got a data frame which looks like this:

> head(p01)
   time key dwell
1   8.13   z  0.00
3   8.13   x  1.25
5   9.38   l  0.87
7  10.25   x  0.15
9  10.40   l  1.13
11 11.53   x  0.45

get it into R:

p01 <- structure(list(time = c(8.13, 8.13, 9.38, 10.25, 10.4, 11.53), 
key = c("z", "x", "l", "x", "l", "x"), dwell = c(0, 1.25, 
0.869, 0.15, 1.13, 0.45)), .Names = c("time", "key", "dwell"), row.names = c(1L, 3L, 5L, 7L, 9L, 11L), class = "data.frame")

Now I want to count the occurences of each letter in p01$key and print them in p01$occurences, so that the result would look like this:

    time key dwell occurences
1   8.13   z  0.00          1
3   8.13   x  1.25          3
5   9.38   l  0.87          2
7  10.25   x  0.15          3
9  10.40   l  1.13          2
11 11.53   x  0.45          3

The way I do it now is:

p01[p01$key == "l", "occurences"] <- table(p01$key)["l"]
p01[p01$key == "x", "occurences"] <- table(p01$key)["x"]
p01[p01$key == "z", "occurences"] <- table(p01$key)["z"]

...which of course is not the best solution. Especially since the real data contains more possibilities in p01$key (one of 16 different letters).

On top of that I want to calculate total dwell for each letter, so what I'm doing now is:

p01[p01$key == "l", "total_dwell"] <- tapply(p01$dwell, p01$key, sum)["l"]
p01[p01$key == "x", "total_dwell"] <- tapply(p01$dwell, p01$key, sum)["x"]
p01[p01$key == "z", "total_dwell"] <- tapply(p01$dwell, p01$key, sum)["z"]

in order to get:

    time key dwell total_dwell
1   8.13   z  0.00        0.00
3   8.13   x  1.25        1.85
5   9.38   l  0.87        2.00
7  10.25   x  0.15        1.85
9  10.40   l  1.13        2.00
11 11.53   x  0.45        1.85

I've been googling and going through couple of books for the last 6 hours. Will really appreciate an elegant solution and/or a link to some comprehansive tutorial. My solution is obviously working, but it's not the first time I have to go around the problem like this and my script files are starting to look ridiculous!

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
Kuba Krukar
  • 163
  • 1
  • 9
  • 3
    I"m sure someone will write you up an answer for this, but [this](http://www.jstatsoft.org/v40/i01/paper) is a fairly comprehensive treatment for this type of task. The only omission would be the **data.table** package, probably. – joran Apr 22 '13 at 14:58
  • 1
    My attempt at describing how to convert loops to functions in general: https://github.com/hadley/devtools/wiki/Functionals – hadley Apr 23 '13 at 12:05

4 Answers4

10

If your dataset is huge, try data.table.

library(data.table)
DT <- data.table(p01)
DT[,occurences:=.N,by=key]
DT[,total_dwell:=sum(dwell),by=key]

    time key dwell occurences total_dwell
1:  8.13   z 0.000          1       0.000
2:  8.13   x 1.250          3       1.850
3:  9.38   l 0.869          2       1.999
4: 10.25   x 0.150          3       1.850
5: 10.40   l 1.130          2       1.999
6: 11.53   x 0.450          3       1.850

The two lines of assigning by reference can be combined as follows:

DT[, `:=`(occurences = .N, total_dwell = sum(dwell)), by=key]
Arun
  • 116,683
  • 26
  • 284
  • 387
Roland
  • 127,288
  • 10
  • 191
  • 288
  • Of course you could also use `data.table` for small datasets :). But the `plyr` syntax looks easier to learn to me (note I heavily use `plyr` and no `data.table` just yet). – Paul Hiemstra Apr 22 '13 at 15:05
  • 5
    Actually, when you get used to it, data.table syntax is easier for this kind of operation. – Roland Apr 22 '13 at 15:06
  • What is easier to read is also a matter of taste probably, but `data.table` looks like an awesome package. – Paul Hiemstra Apr 22 '13 at 15:07
  • you can do both at the same time, using quoted `:=` (can't figure out how to type that in comment-space), and you should use `.N` instead of `length(time)` – eddi Apr 22 '13 at 15:20
  • The downside of data.table is that it works completely differently to most other types of objects in R, so you have to learn two ways of thinking about things: the usual R way and the data.table way. The advantage is that this allows data.table to be very fast, but the disadvantage is a higher cognitive overhead. – hadley Apr 23 '13 at 12:06
6

I'd use plyr:

res = ddply(p01, .(key), transform, 
                           occurrences = length(key), 
                           total_dwell = sum(dwell))
res
   time key dwell occurrences total_dwell
1  9.38   l 0.869           2       1.999
2 10.40   l 1.130           2       1.999
3  8.13   x 1.250           3       1.850
4 10.25   x 0.150           3       1.850
5 11.53   x 0.450           3       1.850
6  8.13   z 0.000           1       0.000

Do note that after this, the table is alphabetically sorted on key. You could use order to resort for time:

res[order(res$time),]
   time key dwell occurrences total_dwell
3  8.13   x 1.250           3       1.850
6  8.13   z 0.000           1       0.000
1  9.38   l 0.869           2       1.999
4 10.25   x 0.150           3       1.850
2 10.40   l 1.130           2       1.999
5 11.53   x 0.450           3       1.850
Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
  • 1
    +1 I really like these one-liners in plyr and friends. I am still learning to use these over base R. – Simon O'Hanlon Apr 22 '13 at 15:01
  • Sooo fast! you beat me! +1 ;) – Jilber Urbina Apr 22 '13 at 15:02
  • 1
    Plyr is really nice yes, but a bit slow if the data becomes big. When this is the case, `data.table` is the answer... – Paul Hiemstra Apr 22 '13 at 15:03
  • 1
    ...and just add `total_dwell = sum(dwell)` to include that column as well. – joran Apr 22 '13 at 15:07
  • Thanks so much! I've accepted this answer only because plyr requires a shorter input and together with the paper suggested by joran it should solve many of my past and future problems :) data.table looks very neat too, and perhaps even more intuitive for a beginner. But I'll give plyr a chance first :) cheers guys. – Kuba Krukar Apr 22 '13 at 15:18
  • ok, but as I said this post is more about learning how to solve similar problems in the future than a one-off solution and Wickham's paper attached describes plyr – Kuba Krukar Apr 22 '13 at 15:37
  • ime you'll get a lot more mileage out of `data.table` for your future needs – eddi Apr 22 '13 at 17:18
  • @eddi depends, if your datasets are relatively small (~100k rows) the advantage of data.table is probably going to be modest. In this case plyr solutions will either be comparably fast, or acceptably fast. But I think `data.table` is an awesome tool to learn – Paul Hiemstra Apr 22 '13 at 17:26
  • Just as a remark, also take a look at `mutate` and `summarise`. These also work really well with `ddply`. – Paul Hiemstra Apr 22 '13 at 17:27
  • @PaulHiemstra ok, so what you're saying is that for smaller datasets `data.table` is only *a little* better than `plyr` and so `plyr` is better to use for smaller datasets? :) Again, im(admittedly limited)e, `plyr` only has a few functions to offer that don't have an analog in `data.table` (e.g. `rbind.fill`), and in every other scenario it's slower and has more cumbersome syntax. – eddi Apr 22 '13 at 17:32
  • What can I say - I promise to have a look at both! :) thanks again guys. – Kuba Krukar Apr 22 '13 at 19:03
  • @eddi plyr generalises in different directions to data table. e.g. if you want to fit a linear model to multiple subsets, then extract the coefficients then join them back together, I don't data.table is as helpful. Also data.table does a lot of magic behind the scenes: http://stackoverflow.com/questions/15913832 which makes it harder to predict what it will do. – hadley Apr 23 '13 at 12:18
  • @hadley thanks, after this question I posted http://stackoverflow.com/questions/16153947/when-is-plyr-better-than-data-table to understand a little better what I'm missing out on. The "feature" you point out is I believe a bug and should be fixed :) – eddi Apr 23 '13 at 12:48
3

I don't think you want to use apply here. How about table to get the frequencies then use match to assign the frequencies to your table:

freq <- as.data.frame( table(p01$key) )
    # Var1 Freq
#1    l    2
#2    x    3
#3    z    1

p01$occurences <- freq[ match(p01$key , freq[,1] ) , 2 ]
p01
#   time key dwell occurences
#1   8.13   z 0.000          1
#3   8.13   x 1.250          3
#5   9.38   l 0.869          2
#7  10.25   x 0.150          3
#9  10.40   l 1.130          2
#11 11.53   x 0.450          3

As far as I can tell, the only advantage of this method over plyr solution is that the original ordering of your dataframe is retained. I do not know if you can specify this in the ddply function however (probably you can!).

Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184
  • +1 The order is fixed quite easily by sorting after the analysis. – Paul Hiemstra Apr 22 '13 at 15:05
  • (+1) @PaulHiemstra, I think what Simon was telling is that you can't get an "unsorted" solution from plyr. But you can have both from this one. – Arun Apr 22 '13 at 15:35
2

You can naturally solve this problem with tapply. Note that these makes a new object p01.summary, rather than adding to your object, p01. Another line of code could fix that

p01.summary = with(p01, cbind(occurences=table(key),total.dwell=tapply(dwell,key,sum)))

or

p01.summary = with(p01, do.call(rbind,tapply(dwell,key,function(KEY){
   data.frame(occurence=length(KEY),total.dwell= sum(KEY))
}) ))