select first unique observations with column value=x in R

Question

I want to identify the unique people who get an apple in a defined timeframe. I did this by creating a binary indicator "apples" as follows.

names<-c("tom", "mary", "tom", "john", "mary", "tom", "john", "mary", "john", "mary", "tom", "mary", "john", "john")
dates<-as.Date(c("2010-02-01", "2010-05-01", "2010-03-01", "2010-07-01", "2010-07-01", "2010-06-01", "2010-09-01", "2010-07-01", "2010-11-01", "2010-09-01", "2010-08-01", "2010-11-01", "2010-12-01", "2011-01-01"))
fruit<-as.character(c("apple", "orange", "banana", "kiwi", "apple", "apple", "apple", "orange", "banana", "apple", "kiwi", "apple", "orange", "apple"))
age<-as.numeric(c(60,55,60,57,55,60,57,55,57,55,60,55, 57,57))
sex<-as.character(c("m","f","m","m","f","m","m", "f","m","f","m","f","m", "m"))
df<-data.frame(names,dates, age, sex, fruit)
df


df$apples<-ifelse(df$fruit=='apple' & df$dates>="2010-04-01" & df$dates<"2010-10-01",1,0)
df

 names      dates age sex  fruit apples
1    tom 2010-02-01  60   m  apple      0
2   mary 2010-05-01  55   f orange      0
3    tom 2010-03-01  60   m banana      0
4   john 2010-07-01  57   m   kiwi      0
5   mary 2010-07-01  55   f  apple      1
6    tom 2010-06-01  60   m  apple      1
7   john 2010-09-01  57   m  apple      1
8   mary 2010-07-01  55   f orange      0
9   john 2010-11-01  57   m banana      0
10  mary 2010-09-01  55   f  apple      1
11   tom 2010-08-01  60   m   kiwi      0
12  mary 2010-11-01  55   f  apple      0
13  john 2010-12-01  57   m orange      0
14  john 2011-01-01  57   m  apple      0

My problem is that Mary is in there twice. I only want the first date on which she got an apple in the specified timeframe (and everyone elses first date in the real data). I would like a second column called "apples1" which flags each persons initial date in the defined timeframe that they got an apple.

Desired output:

 names      dates age sex  fruit apples apples1
1    tom 2010-02-01  60   m  apple      0       0
2   mary 2010-05-01  55   f orange      0       0
3    tom 2010-03-01  60   m banana      0       0
4   john 2010-07-01  57   m   kiwi      0       0
5   mary 2010-07-01  55   f  apple      1       1
6    tom 2010-06-01  60   m  apple      1       1
7   john 2010-09-01  57   m  apple      1       1
8   mary 2010-07-01  55   f orange      0       0
9   john 2010-11-01  57   m banana      0       0
10  mary 2010-09-01  55   f  apple      1       0
11   tom 2010-08-01  60   m   kiwi      0       0
12  mary 2010-11-01  55   f  apple      0       0
13  john 2010-12-01  57   m orange      0       0
14  john 2011-01-01  57   m  apple      0       0

I've been searching, and the nearest thing is this - Select only the first rows for each unique value of a column in R. But this doesn't address unique ids. I've also come across !duplicated, but I don't want to remove mary's data, as I need her dates to remain to follow up on her. I'm probably missing something really fundamental here, apologies in advance.

Also check [here](http://stackoverflow.com/questions/15164759/using-ifelse-with-transform-in-ddply) — Metrics, Jun 30 '13 at 03:02

Rguy · Answer 1 · 2013-06-30T02:55:31.707

1

library(plyr)
df <- df[order(df$dates), ]
ddply(df, "names", transform, 
  apple1 = as.numeric(!duplicated(fruit) & fruit == "apple")
)

Note: I'm assuming ddply retains the ordering on the data frame when it splits by the splitting variables. From my experience that has been the case, but you could modify this solution slightly by changing transform to an inline function which performs the ordering clause, which I do not believe to be necessary.

edited Jun 30 '13 at 02:55

answered Jun 30 '13 at 02:50

Rguy

1,622
1
15
20

Thank you Rguy - much appreciated. Need to do more reading on ddply and transform. – user2363642 Jun 30 '13 at 17:09

score 1 · Accepted Answer · answered Jun 30 '13 at 03:20

Here a data.table solution. I create the 2 columns in the same time.

DT <- data.table(df)
setkeyv(DT,c("names","dates"))
DT[ fruit == "apple" & 
    dates >= "2010-04-01" & 
    dates <  "2010-10-01",
    `:=`(c('apples','apples1') ,
         list(1,
         {ifelse(!duplicated(names),1,0)}))
         ]

   names      dates age sex  fruit apples apples1
 1:  john 2010-07-01  57   m   kiwi     NA      NA
 2:  john 2010-09-01  57   m  apple      1       1
 3:  john 2010-11-01  57   m banana     NA      NA
 4:  john 2010-12-01  57   m orange     NA      NA
 5:  john 2011-01-01  57   m  apple     NA      NA
 6:  mary 2010-05-01  55   f orange     NA      NA
 7:  mary 2010-07-01  55   f  apple      1       1
 8:  mary 2010-07-01  55   f orange     NA      NA
 9:  mary 2010-09-01  55   f  apple      1       0
10:  mary 2010-11-01  55   f  apple     NA      NA
11:   tom 2010-02-01  60   m  apple     NA      NA
12:   tom 2010-03-01  60   m banana     NA      NA
13:   tom 2010-06-01  60   m  apple      1       1
14:   tom 2010-08-01  60   m   kiwi     NA      NA

Thank you agstudy, thats great that you can do the line columns in one. — user2363642, Jun 30 '13 at 17:10

select first unique observations with column value=x in R

2 Answers2