2

I'm relatively new to R, so forgive me if this seems like a dumb question. I've started to run out of ideas from other examples on how to make this work, and I was hoping someone could help guide me in the right direction to get it working.

So I'm attempting to do a count distinct on SITE_ID to CLNCL_TRIAL_ID.

My data is actually in a dataframe (data2), but this is kind of what it looks like:

CLNCL_TRIAL_ID:
89794,
89794,
8613,
8613

SITE_ID:
12456,
12456,
100341,
30807

The Idea that my end result would be like count of 89794=1 and 8613=2

Here's what I have so far:

z <- aggregate(data2$SITE_ID ~ data2$CLNCL_TRIAL_ID, data2, function(SITE_ID) length(unique(data2$SITE_ID)))

and I've attempted some alternate forms

aggregate(SITE_ID ~ CLNCL_TRIAL_ID, data2, sum(!duplicated(data$SITE_ID)))

  aggregate(SITE_ID ~ CLNCL_TRIAL_ID, data2, nlevels(factor(data2$SITE_ID)))

  aggregate(SITE_ID ~ CLNCL_TRIAL_ID, data2, function(SITE_ID) length(unique(data2$SITE_ID)))

I keep running into the problem that instead of grouping by trial_ID, it is counting for the whole table. so 89794=3 and 8613=3.

Does anyone have an idea how to correct this issue? I feel like i'm overlooking something silly. Also, as a side note: I'm trying to keep this limited to the base package of R if at all possible. If it isn't possible, no biggie.

Brent Washburne
  • 12,904
  • 4
  • 60
  • 82
  • 1
    table() might help. see this question/answer http://stackoverflow.com/questions/4215154/count-unique-values-in-r – Liz Young Jul 01 '15 at 22:51

4 Answers4

2

A couple of methods:

Data:

df <- data.frame(CLNCL_TRIAL_ID = c(89794, 89794,8613, 8613), SITE_ID = c(12456, 12456, 100341, 30807))

Base R - table:

table(df)
               SITE_ID
CLNCL_TRIAL_ID 12456 30807 100341
     8613      0     1      1
     89794     2     0      0

dplyr:

library(dplyr)
df %>% 
  group_by(CLNCL_TRIAL_ID, SITE_ID) %>%
  summarise(count = n())

  CLNCL_TRIAL_ID SITE_ID count
1           8613   30807     1
2           8613  100341     1
3          89794   12456     2

Update

To count distinct, just use unique for base r, or distinct for dplyr:

table(unique(df))
## to group/summarise the results you can use rowSums()
rowSums(table(unique(df)))


df %>%
distinct %>%
group_by(CLNCL_TRIAL_ID) %>%
summarise(count = n())

or, more succintly using Marek's suggestion

df %>% distinct %>% count(CLNCL_TRIAL_ID)
tospig
  • 7,762
  • 14
  • 40
  • 79
  • I think I've missed the part of the question asking for `distinct' counts. will udpate – tospig Jul 01 '15 at 23:34
  • 2
    `group_by(x) %>% summarise(n=n())` could be replaced by `count(x)`, so `df %>% distinct %>% count(CLNCL_TRIAL_ID)` – Marek Jul 01 '15 at 23:50
1

Using functions from dplyr package:

require(dplyr)
data2 %>%
     group_by(CLNCL_TRIAL_ID) %>%
     summarise(nd = n_distinct(SITE_ID))

Your original approach doesn't work cause you reference original data set in function. Each of following will work:

aggregate(SITE_ID ~ CLNCL_TRIAL_ID, data2, function(x) length(unique(x)))
aggregate(SITE_ID ~ CLNCL_TRIAL_ID, data2, function(x) sum(!duplicated(x)))
aggregate(SITE_ID ~ CLNCL_TRIAL_ID, data2, function(x) nlevels(factor(x)))

Also:

aggregate(SITE_ID ~ CLNCL_TRIAL_ID, data2, n_distinct)

if you wish to mix base and dplyr

Marek
  • 49,472
  • 15
  • 99
  • 121
  • Thank you very much Marek. You and tospig both helped quite a bit and I have it working now. As more of a curiousity/learning question. Would you guys happen to know why my original idea wasn't aggregating correctly? – user3776991 Jul 02 '15 at 04:01
  • @user3776991 Function passed to `aggregate` was computed on original data (you used `$` reference). Should be: `aggregate(SITE_ID ~ CLNCL_TRIAL_ID, data2, function(x) length(unique(x)))` and similar for your alternatives. – Marek Jul 02 '15 at 05:52
1

A solution with the data.table package:

require(data.table)
df <- data.table(CLNCL_TRIAL_ID = c(89794, 89794,8613, 8613), 
    SITE_ID = c(12456, 12456, 100341, 30807))
df[,length(unique(SITE_ID)),by=CLNCL_TRIAL_ID]

Produces

   CLNCL_TRIAL_ID V1
1:          89794  1
2:           8613  2
PavoDive
  • 6,322
  • 2
  • 29
  • 55
0

Another way is by using the fun.aggregate argument of the dcast function (which you can find in the reshape2 package).

library(reshape2)
dcast(df, CLNCL_TRIAL_ID ~ SITE_ID, length, margins = T)
Using SITE_ID as value column: use value.var to override.
  CLNCL_TRIAL_ID 12456 30807 100341 (all)
1           8613     0     1      1     2
2          89794     2     0      0     2
3          (all)     2     1      1     4

But if you do not want a cross table you can just use tapply:

tapply(df$CLNCL_TRIAL_ID, df$SITE_ID, length)
 12456  30807 100341 
     2      1      1 
SabDeM
  • 7,050
  • 2
  • 25
  • 38