1

I am new to R and stackoverflow. It is arranged by Name and ActivityDate. I am trying to add LastSaleDate to my dataframe. I am trying to get LastSaleDate only within that group(by Name). I tried to do something with dplyr but to no avail. Your help is greatly appreciated.

Name      ActivityType      ActivityDate    SalesAmount  LastSaleDate(Desired)          
John       Email            1/1/2014        NA            NA            
John       Sale             2/1/2014        1000          NA            
John       Sale             3/1/2014        2000          2/1/2014          
John       Seminar          4/1/2014        NA            3/1/2014          
John       Webinar          5/1/2014        NA            3/1/2014          
Tom        Email            1/1/2014        NA            NA            
Tom        Sale             2/1/2015        1000          NA            
Tom        Sale             3/1/2015        2000          2/1/2015          
Tom        Seminar          4/1/2015        NA            3/1/2015          
Tom        Webinar          5/1/2015        NA            3/1/2015          
gibbz00
  • 1,947
  • 1
  • 19
  • 31
  • Not clear how you computed the desired column? – user227710 May 18 '15 at 21:37
  • `custlife %>% group_by(Name) %>% mutate(LastSaleDate=last(ActivityDate))` based on your wording. However, it doesn't match your output. Note that `SalesForce_Name` is not shown here...that could be part of the problem. – Frank May 18 '15 at 21:37
  • Sorry, I just used SalesForce_Name as it was the original column name. I typed in Name only to simplify things. I edited the code. – gibbz00 May 18 '15 at 21:44
  • LastSaleDate is the most recent sale date within a group(name in this case). – gibbz00 May 18 '15 at 21:46
  • @Frank LastSaleDate is the most recent sale date within a group(name in this case). LastSaleDate=last(ActivityDate) does not do that. Kindly let me know. – gibbz00 May 18 '15 at 21:58
  • Oh, I see. Maybe you could provide your data in a reproducible form (like from `dput`). http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Frank May 18 '15 at 22:04
  • Your `dput` doesn't match the desired column. – Frank May 18 '15 at 22:21

1 Answers1

1

Here's one way:

require(zoo)
custlife %>% 
  group_by(Name) %>% 
  mutate(lastsale=na.locf(lag(ifelse(ActivityType=="Sale",ActivityDate,NA)),na.rm=FALSE))

which seems to match up:

Source: local data frame [10 x 6]
Groups: Name

   Name ActivityType ActivityDate SalesAmount LastSaleDate.Desired. lastsale
1  John        Email     1/1/2014          NA                    NA       NA
2  John         Sale     2/1/2014        1000                    NA       NA
3  John         Sale     3/1/2014        2000              2/1/2014 2/1/2014
4  John      Seminar     4/1/2014          NA              3/1/2014 3/1/2014
5  John      Webinar     5/1/2014          NA              3/1/2014 3/1/2014
6   Tom        Email     1/1/2014          NA                    NA       NA
7   Tom         Sale     2/1/2015        1000                    NA       NA
8   Tom         Sale     3/1/2015        2000              2/1/2015 2/1/2015
9   Tom      Seminar     4/1/2015          NA              3/1/2015 3/1/2015
10  Tom      Webinar     5/1/2015          NA              3/1/2015 3/1/2015

How it works:

  • lag is used to look at the lagged value
  • ifelse is substituting NA where a lagged value is not available
  • na.locf from zoo fills in the NAs with the most recent value (if any)
Frank
  • 66,179
  • 8
  • 96
  • 180
  • `data.table` is the same, but I won't add it because I cannot test it (since the `shift` function is in development): `setDT(custlife)[,lastsale:=na.locf(shift(ifelse(ActivityType=="Sale",ActivityDate,NA)),na.rm=FALSE),by=Name]` – Frank May 18 '15 at 22:26
  • Thank you so much Frank! However, I am having trouble applying your help to my real data. The output is showing a normal lag by 1, your customized ifelse solution is not getting reflected. I will include the subset of the data and the code I am running(based on your help). – gibbz00 May 18 '15 at 22:42
  • @user3749836 Unfortunately, your new `dput` won't read into my R session; it gives an error. By the way, it's generally best not to edit an answer into the body of your question. It makes it harder for later visitors to parse out (i) whether their problem is the same as yours and (ii) what to do. – Frank May 18 '15 at 22:52
  • 1
    Thank you again! never mind about my previous port. Your solution worked perfectly. You saved me from hours of agony. – gibbz00 May 18 '15 at 22:53
  • @user3749836 No problem :) This was an interesting question I can imagine running into myself. – Frank May 18 '15 at 22:56
  • @gibbz00 `na.locf` is "last observation carried forward" to cover `NA`s, but it doesn't know what to do with *leading* `NA`s. I have it set to keep them as `NA`s with the option `na.rm`=FALSE. Without that option on, it would simply omit those leading records, making the vector too short to fit into the data.frame. I don't really know why `na.rm` of TRUE is the default... – Frank May 19 '15 at 14:36
  • Hi Frank, I am having trouble applying your solution to the main dataframe. It worked perfectly on the subset of the dataframe. I am getting Error: incompatible types, expecting a logical vector. I looked into other posts and tried custlife<- custlife %>% group_by(SalesForce_Name) %>% mutate(LastSaleDate=na.locf(lag(as.numeric(ifelse(ActivityType=="Sale",ActivityDate,NA))),na.rm=FALSE)) but it returned all NAs. Kindly help. – gibbz00 May 19 '15 at 16:18
  • This has to do with (1) the initial class of the ActivityDate column, which is simplest if you make it character and (2) the insertion of `as.numeric` into the chain, the effect of which I can't predict...seems better not to have it there. Generally, you should investigate your problem by attempting to make a small reproducible example (say a few rows) with fake data and then use what you've learned in your full application. If you can't work out the small example, then post it as a new question, I'd say. – Frank May 19 '15 at 16:24
  • 1
    I think I found the problem. It is my data integrity issue. – gibbz00 May 19 '15 at 16:46