Selecting variables based on two consecutive dates

Question

I think I have a novel question that, try as I might, I have been unable to solve. I have been using this site for several months to learn R and have been able to solve all of the questions I've had up until now. I am doing a large retrospective cohort study and lets just say our sample looks something like this:

my.df <- data.frame(ID = sample(c(1,2,3), 10, replace = TRUE),
                    Date = seq(as.Date("2012-08-01"),
                               as.Date("2012-11-01"), 1)[sample(1:10, 10)], 
                    ICD = c( 401.3, 401.3, 250.02, 250.02, 110.1,
                             110.1, 250.02, 250.02, 250.02,112.1))

What I need to do is select the ID's that have a specific diagnosis (lets say 250.02) on two consecutive visits. In order to go about doing this, I used code similar to this:

my.df<-with(my.df, my.df[order(ID,(as.Date(Date))), ])

to organize the data based on date then group by ID. My next step, I think, is to either write a loop function or write a function with ddply to select out consecutive dates with the same ICD code. The first problem is I'm working on crappy computers with a VERY large data set and I'm afraid a loop function will be so memory intensive the computers will either freeze or crash. The second problem is that up until now, I have worked mostly by vectorized data to get by and my loop/function programing skills are lacking at best. Any suggestions on how to efficently solve this problem would be appreciated.

You might try one of my favorite toys: `rle(my.df$ICD == 250.02)` which will provide (logicals in this example) the locations of repeats. BTW, I would strongly recommend casting your `ICD` codes as character strings to avoid any chance of floating-point discrepancies. — Carl Witthoft, Sep 03 '13 at 14:48
@CarlWitthoft thanks for the suggestion. I wasn't familiar with rle. Sadly though my data set has >600,000 rows so looking through each isn't feasible... Point taken on the ICD codes — Jellio, Sep 03 '13 at 15:49
I'll try to get some time to post more useful code, but my thought was to do something like calculating `cumsum(rle$lengths)` and extracting the values of those cumsums at the indices of `rle$values==TRUE` to get the row indices of interest in your `my.df` . — Carl Witthoft, Sep 03 '13 at 17:03

Frank · Accepted Answer · 2013-09-03T19:46:59.747

Here's a way, using the data.table package:

require(data.table)
my.dt <- data.table(my.df)
setkey(my.dt,ID,Date)
my.dt[,any(rle(ICD)$lengths>=2),by=ID][V1==TRUE]$ID

setkey sorts the data by ID and then by Date. rle(x)$lengths are the lengths of each consecutive run. by checks for the condition -- any(rle(ICD)$lengths>=2) -- within each ID. The next set of square brackets -- [V1==TRUE] -- subsets the data. You can run each part to see how it works:

my.dt[,any(rle(ICD)$lengths>=2),by=ID] # and...
my.dt[,any(rle(ICD)$lengths>=2),by=ID][V1==TRUE] # and...
my.dt[,any(rle(ICD)$lengths>=2),by=ID][V1==TRUE]$ID

This might also help clarify what's going on:

my.dt[,rle(ICD),by=ID]

EDIT: To subset the data, this works:

my.dt[
    my.dt[,{
        r <- rle(ICD)$lengths
        rep(r>1,r)
    },by=ID]$V1
]

You can run this in pieces, too, to see how it works.

After running both solutions, I've got to go with this as the best. It ran the fastest on the huge data set I was working with and allowed me to quickly subset the data as I needed. Thanks! to both Frank and @Henrik — Jellio, Sep 04 '13 at 20:50

Henrik · Answer 2 · 2013-09-03T20:52:44.447

2

Is this what you want?

library(plyr)
df2 <- arrange(my.df, ID, Date)

# keep ID:s with at least one run longer than 2 
df3 <- ddply(.data = df2, .variables = .(ID), subset,
      any(rle(ICD)$lengths > 1))
df3

# and possibly subset df3 further:
# for each ID and ICD in df3, keep only ICD:s with more than one registration 
df4 <- ddply(.data = df3, .variables = .(ID, ICD), subset,
             length(ICD) > 1)
df4

I suppose @Frank's data.table suggestion will be faster on large data sets though.

edited Sep 03 '13 at 20:52

answered Sep 03 '13 at 17:14

Henrik

65,555
14
143
159

+1. Eh, some people like the plyr syntax better, and it's pretty transparent how this works. `df4` gives something the OP might not want, though: 8-01 and 8-07 for ID 1, which have the same ICD, but not in adjacent dates. – Frank Sep 03 '13 at 17:22
@Frank, thanks for your comment. It seems like my `df4` is different from yours. I hope I don't have messed things up here... – Henrik Sep 03 '13 at 17:34
Oh, argh: never mind! I didn't realize the OP slipped a `sample` into his data without a `set.seed()`. If you try `set.seed(1)` before generating the data, maybe you'll see that `df4` does not always give the rows desired (since you're grouping by ID & ICD, but not by the "run" from rle, you'll get rows with the same value from a different run, potentially). – Frank Sep 03 '13 at 17:48
@Frank, good point! Thanks! Maybe my subsetting of `df3` is just confusing. I try to clarify, otherwise I might just delete that last part. – Henrik Sep 03 '13 at 18:10
@Frank and Henrik, I planned on trying both of your solutions to see which worked faster/was easier but sadly I've run into a problem with my data set. I get the error that "ICD" needs to be an atomic vector and it appears that the data given to us from t he CDW has the ICD codes not being read as numerical :-/ When I get this fixed I'll try both of your solutions and let you know what happens... Stay tuned... – Jellio Sep 03 '13 at 20:25
@Jellio Maybe a stupid question, but: Did you install `plyr` before running the code? `install.packages(c('plyr','data.table'))` I can't think of why/how `ICD` could not be atomic...actually, I've sort of forgotten what that means in R. :) – Frank Sep 03 '13 at 20:29
@Jellio, it is not critical that ICD is numeric for my script to run. Also please note that, as Frank pointed out before, that you use `sample` to create example data. You can end up with no ICD with a run length > 1 - `set.seed` may be useful to make examples reproducible. – Henrik Sep 03 '13 at 20:43
@Frank That's not a stupid question at all, I am a bit of a neophyte. No, both packages were loaded, it turns out that rle cannot handle 'factors' (they don't count as atomic vectors, who knew?). Found that out from here: http://stackoverflow.com/questions/13311457/r-data-structure-and-rle-function There was a really simple work around that i found here: http://stackoverflow.com/questions/2851015/convert-data-frame-columns-from-factors-to-characters/2853231#2853231 After that, both solutions worked well. – Jellio Sep 04 '13 at 20:49
@Jellio. Nice that it worked out finally! When you wrote "ICD codes not being read as numerical" I thought they had been converted to _characters_. That's why I wrote "it is not critical that ICD is numeric". I completely forgot about (the sometimes annoying) factors... – Henrik Sep 04 '13 at 21:06

score 0 · Answer 3 · answered Sep 03 '13 at 15:21

Here's one way:

library(plyr)
my.df <- data.frame(ID=sample(c(1,2,3), 10, replace=TRUE),
                    Date=seq(as.Date("2012-08-01"),
                        as.Date("2012-11-01"), 1)[sample(1:10, 10)], 
                    ICD=c(401.3, 401.3, 250.02, 250.02, 110.1,
                        110.1, 250.02, 250.02, 250.02,112.1))
aggregation.fn <- function(df) {
    df <- arrange(df, Date)
    n <- nrow(df)
    df$consecutive.ICD.are.equal <- c(FALSE, df$ICD[2:n] == df$ICD[1:(n-1)])
    return(df)
}
my.df <- ddply(my.df, .(ID), aggregation.fn)

You can then examine subset(my.df, consecutive.ICD.are.equal & ICD == 250.02).

If your data set is very large you can make ddply run in parallel.

I gave this one a shot earlier on the full dataset and unfortunately it caused a memory allocation error due to the data set size :/ I know there are GPU work arounds but I'm on government computers that won't allow that... Thanks for the suggestion though! — Jellio, Sep 03 '13 at 20:10

Selecting variables based on two consecutive dates

3 Answers3