2

I have a data frame (1 millions of data) that looks like that : (the treatment has multiple possibily of character variable, I just simplified for the question)

ID              Position            Treatment
--20AxECvv-         0           A
--20AxECvv-         -1          A
--20AxECvv-         -2          A
--h9INKewQf-        0           A
--h9INKewQf-        -1          B
zZU7a@8jN           0           B
QUeSNEXmdB          0           C
QUeSNEXmdB          -1          C
qu72Ql@h79          0           C

I just want to keep the ID with exclusif treatment, in other word keep ID who was treated by only one treatment even if it was several times. After, I want to sum the number of ID for each treatment. The result would be :

ID              Position            Treatment
--20AxECvv-         0           A
--20AxECvv-         -1          A
--20AxECvv-         -2          A
zZU7a@8jN           0           B
QUeSNEXmdB          0           C
QUeSNEXmdB          -1          C   
qu72Ql@h79          0           C

And the sum :
A : 1 
B : 1
C : 2

I have any ida how to resolve this, maybe with a loop within a loop but I am a beginner with R.

Anna Carrere
  • 83
  • 1
  • 6
  • Your first question is answered here: [Select groups with more than one distinct value per group](https://stackoverflow.com/questions/33291658/r-subset-data-with-same-id-but-different-categories). The answer explains the use of `if(uniqueN() ) .SD, by = `. For your second question, see [Counting unique / distinct values by group in a data frame](https://stackoverflow.com/questions/12840294/counting-unique-distinct-values-by-group-in-a-data-frame). – Henrik Aug 08 '17 at 11:17

1 Answers1

3

We can use uniqueN to check the number of unique 'Treatment' for each 'ID' and subset based on that

library(data.table)
dt <- setDT(df1)[, if(uniqueN(Treatment)==1) .SD, ID]
dt
#            ID Position Treatment
#1: --20AxECvv-        0         A
#2: --20AxECvv-       -1         A
#3: --20AxECvv-       -2         A
#4:   zZU7a@8jN        0         B
#5:  QUeSNEXmdB        0         C
#6:  QUeSNEXmdB       -1         C
#7:  qu72Ql@h79        0         C

and we find the unique number of 'ID' per 'Treatment

dt[, .(Count = uniqueN(ID)), Treatment]
#    Treatment Count
#1:         A     1
#2:         B     1
#3:         C     2
akrun
  • 874,273
  • 37
  • 540
  • 662
  • And know how I can do to have only the subset with all the first position (ie : the minimum position for each id). Sometime the first position could be -2 or -65. – Anna Carrere Aug 08 '17 at 08:43
  • @AnnaCarrere In that case, `setDT(df1)[, .SD[which.min(Position)], ID]` – akrun Aug 08 '17 at 08:47