Date difference between rows in R with certain conditions from a different column

Question

The data frame has three columns. First column is Machine name with multiple machine numbers(M1, M2..), second column is about the type of test which is test 1 and finally test date indicates when the test was performed.

Below is the data frame for reference :-

Name  Test     Test_Date 
 M1    Test1    10/16/2011
 M1    Test1    1/29/2012
 M1    Test1    1/29/2012
 M2    Test1    7/26/2011
 M2    Test1    7/26/2011
 M2    Test1    5/12/2012
 M2    Test1    5/12/2012
 M2    Test1    10/29/2013
 M3    Test1    9/28/2011
 M3    Test1    1/8/2012
 M3    Test1    9/16/2012
 M3    Test1    6/3/2013
 M3    Test1    7/11/2013
 M3    Test1    8/10/2013
 M3    Test1    9/13/2013

The idea is to create a new column named "issue"(Yes/No) which indicates if a machine undergoes two or more tests(Test1) within a 48 week-span. Looked through multiple resources for this solution, but couldn't find an appropriate one.

So what exactly would be the desired output for this sample input? It seems like every machine here as an issue. — MrFlick, May 25 '15 at 21:35
I still trying to figure out what your question is. Do you want to create a column issue, or is it a requirement? Don't you know how to do it? — Jaques, May 25 '15 at 21:42
@Jaques The desired out put is a new column named issue which will include either Yes/No or True/False based on the condition. — x1carbon, May 25 '15 at 22:11
@MrFlick The data I have showcased here is just a sample of a bigger data frame. — x1carbon, May 25 '15 at 22:13
@VikasPatil I understand this is sample data, but you should include the desired result. I don't understand exactly which rows you think should have Yes. Is this a property of the machine? (should all rows for a given machine have the same value?) Is it a property of the row? (if two tests less that 48 weeks apart, do both get Yes, just the first? just the last?) — MrFlick, May 25 '15 at 22:19
@MrFlick How did you edit my post to appropriate table format? Can you please share a link on these commands so that I can use a similar formats in future posts. — x1carbon, May 27 '15 at 18:56
@explorer There' a help link in the editing box: http://stackoverflow.com/editing-help. Basically i just formatted your data like code (indent with 4 spaces) to make it easier to import. In the future it would be better to include data in a more easily [reprodicible format](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) such as a a `dput()` of a data.frame or code that can be copied/pasted to create the data.frame. — MrFlick, May 27 '15 at 19:48

Saurfang · Answer 1 · 2015-05-25T23:07:50.413

I think you want something like this?

library(dplyr)
library(lubridate)

dat <- read.table(textConnection("Name  Test     Test_Date 
 M1    Test1    10/16/2011
 M1    Test1    1/29/2012
 M1    Test1    1/29/2012
 M2    Test1    7/26/2011
 M2    Test1    7/26/2011
 M2    Test1    5/12/2012
 M2    Test1    5/12/2012
 M2    Test1    10/29/2013
 M3    Test1    9/28/2011
 M3    Test1    1/8/2012
 M3    Test1    9/16/2012
 M3    Test1    6/3/2013
 M3    Test1    7/11/2013
 M3    Test1    8/10/2013
 M3    Test1    9/13/2013"), header = TRUE, stringsAsFactors = FALSE) %>%
  mutate(Test_Date = mdy(Test_Date))

has_issue <- function(dates, current, duration = weeks(8)) {
  as.period(min(abs(interval(dates[-current], dates[current])))) <= duration
}

group_by(dat, Name, Test) %>%
  do({
    dates <- .$Test_Date
    mutate(., row_id = row_number()) %>%
      rowwise() %>%
      transmute(Test_Date, issue = has_issue(dates, row_id))
  }) %>%
  ungroup

This returns

Source: local data frame [15 x 4]

Name  Test  Test_Date issue
1    M1 Test1 2011-10-16 FALSE
2    M1 Test1 2012-01-29  TRUE
3    M1 Test1 2012-01-29  TRUE
4    M2 Test1 2011-07-26  TRUE
5    M2 Test1 2011-07-26  TRUE
6    M2 Test1 2012-05-12  TRUE
7    M2 Test1 2012-05-12  TRUE
8    M2 Test1 2013-10-29 FALSE
9    M3 Test1 2011-09-28 FALSE
10   M3 Test1 2012-01-08 FALSE
11   M3 Test1 2012-09-16 FALSE
12   M3 Test1 2013-06-03  TRUE
13   M3 Test1 2013-07-11  TRUE
14   M3 Test1 2013-08-10  TRUE
15   M3 Test1 2013-09-13  TRUE

bgoldst · Answer 2 · 2015-05-26T00:05:14.940

df <- data.frame(Test=c('Test1','Test1','Test1','Test1','Test1','Test1','Test1','Test1','Test1','Test1','Test1','Test1','Test1','Test1','Test1'), Name=c('M1','M1','M1','M2','M2','M2','M2','M2','M3','M3','M3','M3','M3','M3','M3'), Test_Date=as.Date(c('10/16/2011','1/29/2012','1/29/2012','7/26/2011','7/26/2011','5/12/2012','5/12/2012','10/29/2013','9/28/2011','1/8/2012','9/16/2012','6/3/2013','7/11/2013','8/10/2013','9/13/2013'),'%m/%d/%Y') );
SPAN <- 48*7;
MINTESTS <- 2;
df$issue <- ave(as.integer(df$Test_Date),df$Name,df$Test,FUN=function(dates) apply(outer(dates,dates,`-`),1,function(diffs) if (sum(abs(diffs)<SPAN) >= MINTESTS) 'Yes' else 'No'));
df;
##     Test Name  Test_Date issue
## 1  Test1   M1 2011-10-16   Yes
## 2  Test1   M1 2012-01-29   Yes
## 3  Test1   M1 2012-01-29   Yes
## 4  Test1   M2 2011-07-26   Yes
## 5  Test1   M2 2011-07-26   Yes
## 6  Test1   M2 2012-05-12   Yes
## 7  Test1   M2 2012-05-12   Yes
## 8  Test1   M2 2013-10-29    No
## 9  Test1   M3 2011-09-28   Yes
## 10 Test1   M3 2012-01-08   Yes
## 11 Test1   M3 2012-09-16   Yes
## 12 Test1   M3 2013-06-03   Yes
## 13 Test1   M3 2013-07-11   Yes
## 14 Test1   M3 2013-08-10   Yes
## 15 Test1   M3 2013-09-13   Yes

Notes:

I coerced your date strings to Date class using as.Date(c(...),'%m/%d/%Y'), which is necessary to prepare for datewise arithmetic.
As you can see, I hard-coded SPAN (number of days surrounding a given test date that is considered to be part of its "span") and MINTESTS (minimum number of tests within the span to qualify the row for issue='Yes') as constants in the global environment.
I had to coerce the first argument to ave() to integer because otherwise ave() would automatically try to coerce the return value to Date class, which would fail, since 'Yes' and 'No' are not valid date strings. This is an annoying behavior from ave() that does not appear to be configurable. Fortunately, the input df$Test_Date does not need to be classed as Date the way I use it in FUN().
I group by both df$Name and df$Test, so each machine/test pair is treated differently, with respect to whether or not there were MINTESTS tests during SPAN days surrounding a particular test date for that machine/test.
FUN() works by computing the day difference between every single pair of dates for that machine/test pair (that's what outer(dates,dates,`-`) calculates), then, for each row in the resulting difference matrix, counts how many of those absolute differences are within SPAN, and branches on whether that count surpasses MINTESTS; if it does, 'Yes' is returned; if it does not, 'No' is returned. Thus the issue column results from the ave() call and can be assigned directly to df$issue.

Here's one way you could plot this data:

## compute a key frame: one line per machine/test
pairs <- unique(df[,c('Name','Test')]);

## precompute ticks
xtick <- seq(seq(min(df$Test_Date),by='-1 month',len=2)[2],seq(max(df$Test_Date),by='1 month',len=2)[2],'month');
yspace <- 1/(nrow(pairs)+1);
pairs$ytick <- seq(yspace,1-yspace,len=nrow(pairs));

## precompute point colors using named character vector
pointColor <- c(No='red',Yes='blue');

## draw the plot
par(mar=c(6,6,3,3)+0.1,xaxs='i',yaxs='i'); ## set global plot params
plot(NA,xlim=c(min(xtick),max(xtick)),ylim=c(0,1),axes=F,xlab='',ylab=''); ## define plot bounds
with(merge(df,pairs),points(Test_Date,ytick,col=pointColor[issue],pch=4,cex=1)); ## plot points
axis(1,xtick,strftime(xtick,'%Y-%m'),las=2); ## x-axis
axis(2,c(0,pairs$ytick,1),NA,tcl=0); ## y-axis (full extent, no tick marks)
axis(2,pairs$ytick,paste0(pairs$Name,':',pairs$Test),las=1); ## y-axis (just labels and tick marks on main lines)
title('Machine Test Coverage'); ## title

plot

@VikasPatil, one uncertainty I have is whether you want `SPAN` to be two-sided. IOW do you want (1) tests that occur both within 48 weeks *prior* to the test date *and* within 48 weeks *after* the test date, *or* (2) tests that occur within a single 48 week period surrounding the test date? My code currently assumes the former, but can be easily modified to do the latter by dividing `SPAN` by 2. — bgoldst, May 25 '15 at 22:18
This looks good, but the data I have shared is just the sample of the bigger data frame. There are about 25000 rows with different machine numbers and test dates. How do I iterate through the process? The way I am thinking - is selecting all the M1` together and checking for the condition and follow a similar pattern for other machines as well. Finding id difficult to convert this thinking into a R code. — x1carbon, May 25 '15 at 22:21
@VikasPatil, no iteration is necessary, since `ave()` automatically groups by both machine and test. I'm sure it can handle 25,000 rows. Actually, to state this another way, `ave()` internally iterates over each machine/test pair, so you don't have to do that yourself in your code. — bgoldst, May 25 '15 at 22:25
This works! Thank you so much for the notes as well. Any heads-up on how I can visually represent this column/result/output(issues column). — x1carbon, May 25 '15 at 23:27

IRTFM · Answer 3 · 2015-05-25T22:21:57.483

The first version needed some improvements, since I think it fails with cases where there are fewer than three lines per Machine. After checking for adequate numbers of dates, the second version starts with the third line and sequentially checks each succeeding date to see if the earlier two tests were both within 48weeks.

> dat <- read.table(text="Name  Test     Test_Date 
+  M1    Test1    10/16/2011
+  M1    Test1    1/29/2012
+  M1    Test1    1/29/2012
+  M2    Test1    7/26/2011
+  M2    Test1    7/26/2011
+  M2    Test1    5/12/2012
+  M2    Test1    5/12/2012
+  M2    Test1    10/29/2013
+  M3    Test1    9/28/2011
+  M3    Test1    1/8/2012
+  M3    Test1    9/16/2012
+  M3    Test1    6/3/2013
+  M3    Test1    7/11/2013
+  M3    Test1    8/10/2013
+  M3    Test1    9/13/2013", header=TRUE)
> dat$Tdate <- as.Date(dat$ Test_Date, format="%m/%d/%Y")

> dat$twoIn48wk <- with(dat, ave(as.numeric(Tdate) , Name, 
              FUN=function(x) { z=c(NA,NA); 
                            for( i in seq_along(x)[-(1:2)] ){
                                z <- c(z, (x[i]-x[i-1])<=48*7 & 
                                           (x[i]-x[i-2]) <=48*7)}
                            return(z) }) )
> dat
   Name  Test  Test_Date      Tdate twoIn48wk
1    M1 Test1 10/16/2011 2011-10-16        NA
2    M1 Test1  1/29/2012 2012-01-29        NA
3    M1 Test1  1/29/2012 2012-01-29         1
4    M2 Test1  7/26/2011 2011-07-26        NA
5    M2 Test1  7/26/2011 2011-07-26        NA
6    M2 Test1  5/12/2012 2012-05-12         1
7    M2 Test1  5/12/2012 2012-05-12         1
8    M2 Test1 10/29/2013 2013-10-29         0
9    M3 Test1  9/28/2011 2011-09-28        NA
10   M3 Test1   1/8/2012 2012-01-08        NA
11   M3 Test1  9/16/2012 2012-09-16         0
12   M3 Test1   6/3/2013 2013-06-03         0
13   M3 Test1  7/11/2013 2013-07-11         1
14   M3 Test1  8/10/2013 2013-08-10         1
15   M3 Test1  9/13/2013 2013-09-13         1

This would do the test for the edge condition:

dat$twoIn48wk <- with(dat, ave(as.numeric(Tdate) , Name, 
              FUN=function(x) { if(length(x) < 3){rep(FALSE, length(x))} else{
                                 z=c(NA,NA); 
                            for( i in seq_along(x)[-(1:2)] ){
                                z <- c(z, (x[i]-x[i-1])<=48*7 & 
                                           (x[i]-x[i-2]) <=48*7)}
                            return(z) }}) )

Date difference between rows in R with certain conditions from a different column

3 Answers3