2

My data is a set from big data. Each unique ID includes one or more class, and each class includes one or more unique values of X. BUT, we may have same class in different IDs (i.e, ID009 and ID020 have same class) I'm trying to find out how many unique values of X per class value based on different ID.

ID <- c("ID004", "ID004", "ID004", "ID004", "ID004", "ID004", "ID006", "ID006", "ID006", "ID006", "ID006", "ID006", "ID006", "ID006", "ID006", "ID006", "ID006", "ID006", "ID006", "ID006", "ID006", "ID006", "ID009", "ID009", "ID009", "ID009", "ID009", "ID009", "ID009","ID020", "ID020", "ID020", "ID020", "ID020", "ID020", "ID020", "ID020", "ID023", "ID023", "ID023", "ID023", "ID023", "ID023","ID023", "ID023", "ID023", "ID023")
Class <- c("CMP-001", "CMP-001", "CMP-001", "CMP-001", "CMP-001","CMP-001", "CMP-001", "CMP-001", "CMP-001", "CMP-001", "CMP-001","CMP-001", "CMP-001", "CMP-001", "CMP-002", "CMP-002", "CMP-002","CMP-002", "CMP-002", "CMP-005", "CMP-005", "CMP-005", "CMP-002", "CMP-002", "CMP-002", "CMP-002", "CMP-002","CMP-002", "CMP-002", "CMP-002", "CMP-002", "CMP-002", "CMP-002", "CMP-002","CMP-002", "CMP-004", "CMP-004", "CMP-001", "CMP-001", "CMP-001", "CMP-001", "CMP-001","CMP-001", "CMP-001", "CMP-001", "CMP-001", "CMP-001")
X <- c(1,1,2,3,3,3,4,4,4,4,4,4,4,4,5,5,6,6,6,7,7,8,9,9,10,10,10,10,10,11,11,12,12,13,13,14,14,15,15,15,16,16,17,17,18,18,18)
data <- data.frame(ID, Class, X)

the result should be;

ID         class       No. of X value
ID004     CMP-001           3
ID006     CMP-001           1
          CMP-002           2
          CMP-005           2
ID009     CMP-002           2
ID020     CMP-002           3
          CMP-004           1
ID023     CMP-001           4

Thank you for your help,

Me28
  • 107
  • 7

3 Answers3

2

Here, n_distinct would be useful after doing a grouping with 'ID', 'Class'

library(dplyr)
data %>% 
   group_by(ID, Class) %>%
   summarise(No_X_value = n_distinct(X), .groups = 'drop')

-output

# A tibble: 8 x 3
#  ID    Class   No_X_value
#  <chr> <chr>        <int>
#1 ID004 CMP-001          3
#2 ID006 CMP-001          1
#3 ID006 CMP-002          2
#4 ID006 CMP-005          2
#5 ID009 CMP-002          2
#6 ID020 CMP-002          3
#7 ID020 CMP-004          1
#8 ID023 CMP-001          4

Or with data.table

library(data.table)
setDT(data)[, .(No_X_value = uniqueN(X), .(ID, Class)]

Or using base R with aggregate

aggregate(X ~ ., unique(data), FUN = length)
#     ID   Class X
#1 ID004 CMP-001 3
#2 ID006 CMP-001 1
#3 ID023 CMP-001 4
#4 ID006 CMP-002 2
#5 ID009 CMP-002 2
#6 ID020 CMP-002 3
#7 ID020 CMP-004 1
#8 ID006 CMP-005 2
akrun
  • 874,273
  • 37
  • 540
  • 662
1

You can also use a base R approach like this with aggregate():

#Code
df <- aggregate(X~ID+Class,data,function(x) length(unique(x)))

Output:

     ID   Class X
1 ID004 CMP-001 3
2 ID006 CMP-001 1
3 ID023 CMP-001 4
4 ID006 CMP-002 2
5 ID009 CMP-002 2
6 ID020 CMP-002 3
7 ID020 CMP-004 1
8 ID006 CMP-005 2
Duck
  • 39,058
  • 13
  • 42
  • 84
1

If you are with base R, I think the aggregate approach by @akrun is already a super efficient one. Below is another option but more complicated

subset(as.data.frame(xtabs(cnt ~ ID + Class,cbind(cnt = 1,unique(data)))),Freq >0)

which gives

      ID   Class Freq
1  ID004 CMP-001    3
2  ID006 CMP-001    1
5  ID023 CMP-001    4
7  ID006 CMP-002    2
8  ID009 CMP-002    2
9  ID020 CMP-002    3
14 ID020 CMP-004    1
17 ID006 CMP-005    2
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81