0

I am trying to filter out Client Id's from a data frame that appear within the first three months of my dataset, but DO NOT appear after the end of the first three months, leaving me with the Client Id's that appear in both before and after the first three months. I have included some code to create a mock dataset for illustration:-

    ClientId<-c('hgjj156','jksu990','ddks989','fghs676','shjk992','hddq141','huui667','kili1772','djjp8998','hdyy1122','fghs676','shjk992','hgjj156','jksu990')

    DateStamp<-c('01-01-2015', '01-01-2015', '03-01-2015', '10-01-2015', '22-01-2015', '29-01-2015','05-02-2015','11-02-2015', '19-02-2015', '17-03-2015', '02-04-2015', '06-04-2015', '08-04-2015', '09-04-2015')

    df<-cbind(ClientId, DateStamp)
    df

Which should give you this:-

  ClientId   DateStamp   
 "hgjj156"  "01-01-2015"
 "jksu990"  "01-01-2015"
 "ddks989"  "03-01-2015"
 "fghs676"  "10-01-2015"
 "shjk992"  "22-01-2015"
 "hddq141"  "29-01-2015"
 "huui667"  "05-02-2015"
 "kili1772" "11-02-2015"
 "djjp8998" "19-02-2015"
 "hdyy1122" "17-03-2015"
 "fghs676"  "02-04-2015"
 "shjk992"  "06-04-2015"
 "hgjj156"  "08-04-2015"
 "jksu990"  "09-04-2015"

The idea is that I would be left with the following ID's:-

    ClientId   DateStamp
  "hgjj156"  "01-01-2015"
  "jksu990"  "01-01-2015"
  "fghs676"  "10-01-2015"
  "shjk992"  "22-01-2015"
  "fghs676"  "02-04-2015"
  "shjk992"  "06-04-2015"
  "hgjj156"  "08-04-2015"
  "jksu990"  "09-04-2015"

Is there any idea as to how I would achieve this? I had looked at dplyr and data.table solutions but so far I haven't found which ones would be the most appropriate.

Daniel Widdis
  • 8,424
  • 13
  • 41
  • 63
metaltoaster
  • 380
  • 2
  • 15
  • That third row client is left out because that client does not appear after the first three months. I hope that clarifies things :) – metaltoaster Oct 22 '18 at 15:57
  • How are you defining the filter condition? What do you mean `"leaving me with the Client Id's that appear in both before and after the first three months."` – SmitM Oct 22 '18 at 16:00
  • Essentially, I want to filter the df so that it removes the Client Id's that appear between 01-01-2015 and 31-03-2015, but do not appear after the 31-03-2015. The only Client Id's that should be left are those who appear before and after the 31-01-2015. – metaltoaster Oct 22 '18 at 16:10
  • You are not making sense....`I want to filter the df so that it removes the Client Id's that appear between 01-01-2015 and 31-03-2015, but do not appear after the 31-03-2015` -> This would remove all dates in Jan, Feb, Mar 2015. But then you are saying: `The only Client Id's that should be left are those who appear before and after the 31-01-2015.` -> That is every date except 31st Jan 2015` – SmitM Oct 22 '18 at 16:16

1 Answers1

0

leaving me with the Client Id's that appear in both before and after the first three months

library(data.table)

# formatting
DT = as.data.table(df)
DT[, DateStamp := as.IDate(DateStamp, "%d-%m-%Y")]

# set your thresholds
d_rng = range(DT$DateStamp)
d_dn = seq(d_rng[1], by="+3 months", length.out=2)[2]
d_up = d_dn

# find ids in each window
c_dn = DT[DateStamp < d_dn, unique(ClientId)]
c_up = DT[DateStamp >= d_up, unique(ClientId)]

# filter
DT[ClientId %in% intersect(c_dn, c_up)]

   ClientId  DateStamp
1:  hgjj156 2015-01-01
2:  jksu990 2015-01-01
3:  fghs676 2015-01-10
4:  shjk992 2015-01-22
5:  fghs676 2015-04-02
6:  shjk992 2015-04-06
7:  hgjj156 2015-04-08
8:  jksu990 2015-04-09

I'm borrowing from @GGrothendieck's answer to add/remove months.

Frank
  • 66,179
  • 8
  • 96
  • 180
  • 1
    Hi and thank you for your answer! I tried this and it didnt work at first, but oddly after trying it again it worked perfectly (so odd) but thank you! By that logic, say if I wanted to do the say but for the LAST three months, should I use the same code but change to "-3 months" ? thanks once again! – metaltoaster Oct 26 '18 at 12:52
  • @RobinTurkington Yes, I think so, like `seq(rng[2], by="-3 months", length.out=2)[2]` – Frank Oct 26 '18 at 14:27