Count variable on a Variable R

Question

Tid <- c(1,1,2,2,2,3,4,4)
Uid <- c(10,10,11,11,12,13,10,14)

Data <- data.frame(Tid,Uid)

I would like to know how many different Uid appear on every Tid. My Results should look something like this.

Tid, freqUid 
1, 1
2, 2
3, 1
4, 2

I tried to use count on it but had some issues to use it on more then just one variable.

Answers below are great. Extra care how you deal with NA values. Eg. If you have something like `Tid <- c(1,1,2,2,2,3,4,4); Uid <- c(10,10,11,NA,12,13,10,14); Data <- data.frame(Tid,Uid)` — AntoniosK, Sep 08 '15 at 12:24

David Arenburg · Accepted Answer · 2015-09-08T12:46:02.637

9

With base R

as.data.frame(table(unique(Data)$Tid))
#   Var1 Freq
# 1    1    1
# 2    2    2
# 3    3    1
# 4    4    2

Or (though the column name is less informative)

aggregate(Uid ~ Tid, unique(Data), length)
#   Tid Uid
# 1   1   1
# 2   2   2
# 3   3   1
# 4   4   2

The basic idea here is to only operate on the unique combinations of Tid/Uid and then count the different Tid instances

Edit: per @nicolas comment, we can add tapply too here as a possible solution

as.data.frame.table(tapply(Data$Uid, Data$Tid, function(x) length(unique(x))))
#   Var1 Freq
# 1    1    1
# 2    2    2
# 3    3    1
# 4    4    2

edited Sep 08 '15 at 12:46

answered Sep 08 '15 at 12:18

David Arenburg

91,361
17
137
196

1

Or also `tapply(Data$Uid,Data$Tid,function(x) length(unique(x)))`, just to add another option in `base` R. – nicola Sep 08 '15 at 12:33
1

@nicola thanks :) its the same as `table(unique(Data)$Tid)`. I'm not sure if OP will be satisfied with that output though – David Arenburg Sep 08 '15 at 12:39
Yes, and you can apply the `table` method of `as.data.frame`: `as.data.frame.table(tapply(Data$Uid,Data$Tid,function(x) length(unique(x))))` and we are there. The comment was just to let the reader aware of the `tapply` function. – nicola Sep 08 '15 at 12:44
Ok, thanks, I'll add it. – David Arenburg Sep 08 '15 at 12:45

akrun · Answer 2 · 2015-09-08T13:18:42.163

We can use n_distinct from dplyr. We group by 'Tid', and get the n_distinct for 'Uid' within summarise.

library(dplyr)
Data %>% 
   group_by(Tid) %>%
   summarise(freqUid=n_distinct(Uid))
#    Tid freqUid
#   (dbl)   (int)
#1     1       1
#2     2       2
#3     3       1
#4     4       2

Or we can use uniqueN from data.table. We convert the 'data.frame' to 'data.table' (setDT(Data)), grouped by 'Tid', we get the uniqueN of 'Uid'.

library(data.table)#v1.9.5+
setDT(Data)[, list(freqUid=uniqueN(Uid)), by = Tid]
#  Tid freqUid
#1:   1       1
#2:   2       2
#3:   3       1
#4:   4       2

Benchmarks

Here are some benchmarks using a big dataset

set.seed(24)
Data <- data.frame(Tid=rep(1:1e4, each=100),
       Uid= sample(10:70, 1e4*100, replace=TRUE))
f1 <- function() as.data.frame.table(with(Data, 
          tapply(Uid, Tid, function(.) length(unique(.)))))
f2 <- function() as.data.frame(table(unique(Data)$Tid))
f3 <- function() aggregate(Uid ~ Tid, unique(Data), length)
f4 <- function() Data %>% 
                    group_by(Tid) %>% 
                    summarise(freqUid=n_distinct(Uid))
f5 <- function() as.data.table(Data)[, list(freqUid=uniqueN(Uid)), by = Tid]

library(microbenchmark)
microbenchmark(f1(), f2(), f3(), f4(), f5(), times=20L, unit='relative')
#Unit: relative
# expr       min        lq      mean    median        uq       max neval cld
#f1()  2.357808  2.506813  2.347543  2.401787  2.138740  2.706053    20 a 
#f2() 10.581284 11.798583 11.456316 11.975014 11.411718 10.664648    20 b
#f3() 28.243538 27.740333 25.630334 25.042240 25.590332 23.426749    20 c
#f4()  1.000000  1.000000  1.000000  1.000000  1.000000  1.000000    20 a
#f5()  1.385114  1.369170  1.396271  1.405275  1.354914  1.473114    20 a

If we remove the as.data.frame in f1 and f2 (the output format will be different), and run the benchmarks again.

  f1 <- function() with(Data, tapply(Uid, Tid, function(.) length(unique(.))))
  f2 <- function() table(unique(Data)$Tid)

and as @DavidArenburg mentioned, uniqueN is slower compared to length(unique(.)). So, replacing that in f5

 f5 <- function() as.data.table(Data)[, list(freqUid=length(unique(Uid))),
                                    by = Tid]

 microbenchmark(f1(), f2(), f3(), f4(), f5(), times=20L, unit='relative')
 #Unit: relative
 #expr       min        lq      mean    median        uq        max neval  cld
 #f1()  3.466328  3.052508  2.789366  2.968971  3.069631  1.7850643    20  b  
 #f2() 11.539920 13.372543 12.067983 13.266105 13.014644  7.6774925    20   c 
 #f3() 33.491446 30.839725 27.339148 30.888726 29.953344 17.3956850    20    d
 #f4()  1.254533  1.177933  1.083263  1.213019  1.162862  0.6981573    20 a   
 #f5()  1.000000  1.000000  1.000000  1.000000  1.000000  1.0000000    20 a

Where's is `data.table`? :( Also, probably take out `as.data.frame` from `f2 ` or add it to `f1`, not sure which. — David Arenburg, Sep 08 '15 at 12:56
I know, this is why I suggested to run it without it if we want to compare against `tapply` or to adjust `tapply` accordingly — David Arenburg, Sep 08 '15 at 13:00
@DavidArenburg Removing it will give results in a different format though. — akrun, Sep 08 '15 at 13:02
Also maybe worth running benchmark on `setDT` an a copy of the data instead of `as.data.table`. Also, try `length(unique())` instead of `uniqueN` as it slower (there is a issue on GH regarding it) — David Arenburg, Sep 08 '15 at 13:09
@DavidArenburg It could be faster, but there is a bias as the subsequent runs will be fast. — akrun, Sep 08 '15 at 13:10

talat · Answer 3 · 2015-09-08T15:00:41.517

Just to throw in another dplyr-flavored approach:

library(dplyr)
distinct(Data) %>% count(Tid)
#Source: local data frame [4 x 2]
#
#  Tid n
#1   1 1
#2   2 2
#3   3 1
#4   4 2

(Not suggesting this to be faster than other dplyr/data.table solutions.)

re @David's comment, all proposed solutions get to basically the same result. But of course, my suggestion is not identical with table(unique(Data)$Tid). It's faster and returns a data.frame (not a table object).

Colonel Beauvel · Answer 4 · 2015-09-08T14:51:45.237

2

Another possibility:

library(functional)
by(Uid, Tid, FUN=Compose(unique, length))

or base R as @David Arenburg underlined:

by(Uid, Tid, FUN=function(x) length(unique(x)))

edited Sep 08 '15 at 14:51

answered Sep 08 '15 at 14:20

Colonel Beauvel

30,423
11
47
87

Or just `by(Uid, Tid, FUN=function(x) length(unique(x)))` :P, but +1 for the most original answer here – David Arenburg Sep 08 '15 at 14:40
well reason for `Compose` is I try to make concise and 'speaking' code (not necessarily an asset since it is subjective and I use packages..) – Colonel Beauvel Sep 08 '15 at 14:52
Maybe then explain what `Compose` does as I have no idea. I also wonder if this should be wrapped into `do.call(rbind...` – David Arenburg Sep 08 '15 at 14:56
it just chain several functions into a single one, like a mathematical operator! – Colonel Beauvel Sep 08 '15 at 15:01

Count variable on a Variable R

4 Answers4

Benchmarks

Linked

Related