1
Tid <- c(1,1,2,2,2,3,4,4)
Uid <- c(10,10,11,11,12,13,10,14)

Data <- data.frame(Tid,Uid)

I would like to know how many different Uid appear on every Tid. My Results should look something like this.

Tid, freqUid 
1, 1
2, 2
3, 1
4, 2

I tried to use count on it but had some issues to use it on more then just one variable.

Carlo
  • 397
  • 1
  • 3
  • 14
  • Answers below are great. Extra care how you deal with NA values. Eg. If you have something like `Tid <- c(1,1,2,2,2,3,4,4); Uid <- c(10,10,11,NA,12,13,10,14); Data <- data.frame(Tid,Uid)` – AntoniosK Sep 08 '15 at 12:24

4 Answers4

9

With base R

as.data.frame(table(unique(Data)$Tid))
#   Var1 Freq
# 1    1    1
# 2    2    2
# 3    3    1
# 4    4    2

Or (though the column name is less informative)

aggregate(Uid ~ Tid, unique(Data), length)
#   Tid Uid
# 1   1   1
# 2   2   2
# 3   3   1
# 4   4   2

The basic idea here is to only operate on the unique combinations of Tid/Uid and then count the different Tid instances


Edit: per @nicolas comment, we can add tapply too here as a possible solution

as.data.frame.table(tapply(Data$Uid, Data$Tid, function(x) length(unique(x))))
#   Var1 Freq
# 1    1    1
# 2    2    2
# 3    3    1
# 4    4    2
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • 1
    Or also `tapply(Data$Uid,Data$Tid,function(x) length(unique(x)))`, just to add another option in `base` R. – nicola Sep 08 '15 at 12:33
  • 1
    @nicola thanks :) its the same as `table(unique(Data)$Tid)`. I'm not sure if OP will be satisfied with that output though – David Arenburg Sep 08 '15 at 12:39
  • Yes, and you can apply the `table` method of `as.data.frame`: `as.data.frame.table(tapply(Data$Uid,Data$Tid,function(x) length(unique(x))))` and we are there. The comment was just to let the reader aware of the `tapply` function. – nicola Sep 08 '15 at 12:44
  • Ok, thanks, I'll add it. – David Arenburg Sep 08 '15 at 12:45
4

We can use n_distinct from dplyr. We group by 'Tid', and get the n_distinct for 'Uid' within summarise.

library(dplyr)
Data %>% 
   group_by(Tid) %>%
   summarise(freqUid=n_distinct(Uid))
#    Tid freqUid
#   (dbl)   (int)
#1     1       1
#2     2       2
#3     3       1
#4     4       2

Or we can use uniqueN from data.table. We convert the 'data.frame' to 'data.table' (setDT(Data)), grouped by 'Tid', we get the uniqueN of 'Uid'.

library(data.table)#v1.9.5+
setDT(Data)[, list(freqUid=uniqueN(Uid)), by = Tid]
#  Tid freqUid
#1:   1       1
#2:   2       2
#3:   3       1
#4:   4       2

Benchmarks

Here are some benchmarks using a big dataset

set.seed(24)
Data <- data.frame(Tid=rep(1:1e4, each=100),
       Uid= sample(10:70, 1e4*100, replace=TRUE))
f1 <- function() as.data.frame.table(with(Data, 
          tapply(Uid, Tid, function(.) length(unique(.)))))
f2 <- function() as.data.frame(table(unique(Data)$Tid))
f3 <- function() aggregate(Uid ~ Tid, unique(Data), length)
f4 <- function() Data %>% 
                    group_by(Tid) %>% 
                    summarise(freqUid=n_distinct(Uid))
f5 <- function() as.data.table(Data)[, list(freqUid=uniqueN(Uid)), by = Tid]

library(microbenchmark)
microbenchmark(f1(), f2(), f3(), f4(), f5(), times=20L, unit='relative')
#Unit: relative
# expr       min        lq      mean    median        uq       max neval cld
#f1()  2.357808  2.506813  2.347543  2.401787  2.138740  2.706053    20 a 
#f2() 10.581284 11.798583 11.456316 11.975014 11.411718 10.664648    20 b
#f3() 28.243538 27.740333 25.630334 25.042240 25.590332 23.426749    20 c
#f4()  1.000000  1.000000  1.000000  1.000000  1.000000  1.000000    20 a
#f5()  1.385114  1.369170  1.396271  1.405275  1.354914  1.473114    20 a 

If we remove the as.data.frame in f1 and f2 (the output format will be different), and run the benchmarks again.

  f1 <- function() with(Data, tapply(Uid, Tid, function(.) length(unique(.))))
  f2 <- function() table(unique(Data)$Tid)

and as @DavidArenburg mentioned, uniqueN is slower compared to length(unique(.)). So, replacing that in f5

 f5 <- function() as.data.table(Data)[, list(freqUid=length(unique(Uid))),
                                    by = Tid]

 microbenchmark(f1(), f2(), f3(), f4(), f5(), times=20L, unit='relative')
 #Unit: relative
 #expr       min        lq      mean    median        uq        max neval  cld
 #f1()  3.466328  3.052508  2.789366  2.968971  3.069631  1.7850643    20  b  
 #f2() 11.539920 13.372543 12.067983 13.266105 13.014644  7.6774925    20   c 
 #f3() 33.491446 30.839725 27.339148 30.888726 29.953344 17.3956850    20    d
 #f4()  1.254533  1.177933  1.083263  1.213019  1.162862  0.6981573    20 a   
 #f5()  1.000000  1.000000  1.000000  1.000000  1.000000  1.0000000    20 a   
akrun
  • 874,273
  • 37
  • 540
  • 662
2

Just to throw in another dplyr-flavored approach:

library(dplyr)
distinct(Data) %>% count(Tid)
#Source: local data frame [4 x 2]
#
#  Tid n
#1   1 1
#2   2 2
#3   3 1
#4   4 2

(Not suggesting this to be faster than other dplyr/data.table solutions.)


re @David's comment, all proposed solutions get to basically the same result. But of course, my suggestion is not identical with table(unique(Data)$Tid). It's faster and returns a data.frame (not a table object).

talat
  • 68,970
  • 21
  • 126
  • 157
2

Another possibility:

library(functional)
by(Uid, Tid, FUN=Compose(unique, length))

or base R as @David Arenburg underlined:

by(Uid, Tid, FUN=function(x) length(unique(x)))
Colonel Beauvel
  • 30,423
  • 11
  • 47
  • 87