R Aggregation Error: count distinct

Question

I'm relatively new to R, so forgive me if this seems like a dumb question. I've started to run out of ideas from other examples on how to make this work, and I was hoping someone could help guide me in the right direction to get it working.

So I'm attempting to do a count distinct on SITE_ID to CLNCL_TRIAL_ID.

My data is actually in a dataframe (data2), but this is kind of what it looks like:

CLNCL_TRIAL_ID:
89794,
89794,
8613,
8613

SITE_ID:
12456,
12456,
100341,
30807

The Idea that my end result would be like count of 89794=1 and 8613=2

Here's what I have so far:

z <- aggregate(data2$SITE_ID ~ data2$CLNCL_TRIAL_ID, data2, function(SITE_ID) length(unique(data2$SITE_ID)))

and I've attempted some alternate forms

aggregate(SITE_ID ~ CLNCL_TRIAL_ID, data2, sum(!duplicated(data$SITE_ID)))

  aggregate(SITE_ID ~ CLNCL_TRIAL_ID, data2, nlevels(factor(data2$SITE_ID)))

  aggregate(SITE_ID ~ CLNCL_TRIAL_ID, data2, function(SITE_ID) length(unique(data2$SITE_ID)))

I keep running into the problem that instead of grouping by trial_ID, it is counting for the whole table. so 89794=3 and 8613=3.

Does anyone have an idea how to correct this issue? I feel like i'm overlooking something silly. Also, as a side note: I'm trying to keep this limited to the base package of R if at all possible. If it isn't possible, no biggie.

table() might help. see this question/answer http://stackoverflow.com/questions/4215154/count-unique-values-in-r — Liz Young, Jul 01 '15 at 22:51

tospig · Answer 1 · 2015-07-02T00:25:53.603

A couple of methods:

Data:

df <- data.frame(CLNCL_TRIAL_ID = c(89794, 89794,8613, 8613), SITE_ID = c(12456, 12456, 100341, 30807))

Base R - table:

table(df)
               SITE_ID
CLNCL_TRIAL_ID 12456 30807 100341
     8613      0     1      1
     89794     2     0      0

dplyr:

library(dplyr)
df %>% 
  group_by(CLNCL_TRIAL_ID, SITE_ID) %>%
  summarise(count = n())

  CLNCL_TRIAL_ID SITE_ID count
1           8613   30807     1
2           8613  100341     1
3          89794   12456     2

Update

To count distinct, just use unique for base r, or distinct for dplyr:

table(unique(df))
## to group/summarise the results you can use rowSums()
rowSums(table(unique(df)))


df %>%
distinct %>%
group_by(CLNCL_TRIAL_ID) %>%
summarise(count = n())

or, more succintly using Marek's suggestion

df %>% distinct %>% count(CLNCL_TRIAL_ID)

I think I've missed the part of the question asking for `distinct' counts. will udpate — tospig, Jul 01 '15 at 23:34
`group_by(x) %>% summarise(n=n())` could be replaced by `count(x)`, so `df %>% distinct %>% count(CLNCL_TRIAL_ID)` — Marek, Jul 01 '15 at 23:50

Marek · Answer 2 · 2015-07-02T06:51:20.207

1

Using functions from dplyr package:

require(dplyr)
data2 %>%
     group_by(CLNCL_TRIAL_ID) %>%
     summarise(nd = n_distinct(SITE_ID))

Your original approach doesn't work cause you reference original data set in function. Each of following will work:

aggregate(SITE_ID ~ CLNCL_TRIAL_ID, data2, function(x) length(unique(x)))
aggregate(SITE_ID ~ CLNCL_TRIAL_ID, data2, function(x) sum(!duplicated(x)))
aggregate(SITE_ID ~ CLNCL_TRIAL_ID, data2, function(x) nlevels(factor(x)))

Also:

aggregate(SITE_ID ~ CLNCL_TRIAL_ID, data2, n_distinct)

if you wish to mix base and dplyr

edited Jul 02 '15 at 06:51

answered Jul 01 '15 at 23:48

Marek

49,472
15
99
121

Thank you very much Marek. You and tospig both helped quite a bit and I have it working now. As more of a curiousity/learning question. Would you guys happen to know why my original idea wasn't aggregating correctly? – user3776991 Jul 02 '15 at 04:01
@user3776991 Function passed to `aggregate` was computed on original data (you used `$` reference). Should be: `aggregate(SITE_ID ~ CLNCL_TRIAL_ID, data2, function(x) length(unique(x)))` and similar for your alternatives. – Marek Jul 02 '15 at 05:52

score 1 · Answer 3 · answered Jul 02 '15 at 05:16

A solution with the data.table package:

require(data.table)
df <- data.table(CLNCL_TRIAL_ID = c(89794, 89794,8613, 8613), 
    SITE_ID = c(12456, 12456, 100341, 30807))
df[,length(unique(SITE_ID)),by=CLNCL_TRIAL_ID]

Produces

   CLNCL_TRIAL_ID V1
1:          89794  1
2:           8613  2

SabDeM · Answer 4 · 2015-07-01T23:50:23.470

Another way is by using the fun.aggregate argument of the dcast function (which you can find in the reshape2 package).

library(reshape2)
dcast(df, CLNCL_TRIAL_ID ~ SITE_ID, length, margins = T)
Using SITE_ID as value column: use value.var to override.
  CLNCL_TRIAL_ID 12456 30807 100341 (all)
1           8613     0     1      1     2
2          89794     2     0      0     2
3          (all)     2     1      1     4

But if you do not want a cross table you can just use tapply:

tapply(df$CLNCL_TRIAL_ID, df$SITE_ID, length)
 12456  30807 100341 
     2      1      1

R Aggregation Error: count distinct

4 Answers4