1

I think I have a novel question that, try as I might, I have been unable to solve. I have been using this site for several months to learn R and have been able to solve all of the questions I've had up until now. I am doing a large retrospective cohort study and lets just say our sample looks something like this:

my.df <- data.frame(ID = sample(c(1,2,3), 10, replace = TRUE),
                    Date = seq(as.Date("2012-08-01"),
                               as.Date("2012-11-01"), 1)[sample(1:10, 10)], 
                    ICD = c( 401.3, 401.3, 250.02, 250.02, 110.1,
                             110.1, 250.02, 250.02, 250.02,112.1))

What I need to do is select the ID's that have a specific diagnosis (lets say 250.02) on two consecutive visits. In order to go about doing this, I used code similar to this:

my.df<-with(my.df, my.df[order(ID,(as.Date(Date))), ])

to organize the data based on date then group by ID. My next step, I think, is to either write a loop function or write a function with ddply to select out consecutive dates with the same ICD code. The first problem is I'm working on crappy computers with a VERY large data set and I'm afraid a loop function will be so memory intensive the computers will either freeze or crash. The second problem is that up until now, I have worked mostly by vectorized data to get by and my loop/function programing skills are lacking at best. Any suggestions on how to efficently solve this problem would be appreciated.

Jilber Urbina
  • 58,147
  • 10
  • 114
  • 138
Jellio
  • 25
  • 6
  • 1
    You might try one of my favorite toys: `rle(my.df$ICD == 250.02)` which will provide (logicals in this example) the locations of repeats. BTW, I would strongly recommend casting your `ICD` codes as character strings to avoid any chance of floating-point discrepancies. – Carl Witthoft Sep 03 '13 at 14:48
  • @CarlWitthoft thanks for the suggestion. I wasn't familiar with rle. Sadly though my data set has >600,000 rows so looking through each isn't feasible... Point taken on the ICD codes – Jellio Sep 03 '13 at 15:49
  • I'll try to get some time to post more useful code, but my thought was to do something like calculating `cumsum(rle$lengths)` and extracting the values of those cumsums at the indices of `rle$values==TRUE` to get the row indices of interest in your `my.df` . – Carl Witthoft Sep 03 '13 at 17:03

3 Answers3

3

Here's a way, using the data.table package:

require(data.table)
my.dt <- data.table(my.df)
setkey(my.dt,ID,Date)
my.dt[,any(rle(ICD)$lengths>=2),by=ID][V1==TRUE]$ID

setkey sorts the data by ID and then by Date. rle(x)$lengths are the lengths of each consecutive run. by checks for the condition -- any(rle(ICD)$lengths>=2) -- within each ID. The next set of square brackets -- [V1==TRUE] -- subsets the data. You can run each part to see how it works:

my.dt[,any(rle(ICD)$lengths>=2),by=ID] # and...
my.dt[,any(rle(ICD)$lengths>=2),by=ID][V1==TRUE] # and...
my.dt[,any(rle(ICD)$lengths>=2),by=ID][V1==TRUE]$ID

This might also help clarify what's going on:

my.dt[,rle(ICD),by=ID]

EDIT: To subset the data, this works:

my.dt[
    my.dt[,{
        r <- rle(ICD)$lengths
        rep(r>1,r)
    },by=ID]$V1
]

You can run this in pieces, too, to see how it works.

Frank
  • 66,179
  • 8
  • 96
  • 180
  • After running both solutions, I've got to go with this as the best. It ran the fastest on the huge data set I was working with and allowed me to quickly subset the data as I needed. Thanks! to both Frank and @Henrik – Jellio Sep 04 '13 at 20:50
2

Is this what you want?

library(plyr)
df2 <- arrange(my.df, ID, Date)

# keep ID:s with at least one run longer than 2 
df3 <- ddply(.data = df2, .variables = .(ID), subset,
      any(rle(ICD)$lengths > 1))
df3

# and possibly subset df3 further:
# for each ID and ICD in df3, keep only ICD:s with more than one registration 
df4 <- ddply(.data = df3, .variables = .(ID, ICD), subset,
             length(ICD) > 1)
df4

I suppose @Frank's data.table suggestion will be faster on large data sets though.

Henrik
  • 65,555
  • 14
  • 143
  • 159
  • +1. Eh, some people like the plyr syntax better, and it's pretty transparent how this works. `df4` gives something the OP might not want, though: 8-01 and 8-07 for ID 1, which have the same ICD, but not in adjacent dates. – Frank Sep 03 '13 at 17:22
  • @Frank, thanks for your comment. It seems like my `df4` is different from yours. I hope I don't have messed things up here... – Henrik Sep 03 '13 at 17:34
  • Oh, argh: never mind! I didn't realize the OP slipped a `sample` into his data without a `set.seed()`. If you try `set.seed(1)` before generating the data, maybe you'll see that `df4` does not always give the rows desired (since you're grouping by ID & ICD, but not by the "run" from rle, you'll get rows with the same value from a different run, potentially). – Frank Sep 03 '13 at 17:48
  • @Frank, good point! Thanks! Maybe my subsetting of `df3` is just confusing. I try to clarify, otherwise I might just delete that last part. – Henrik Sep 03 '13 at 18:10
  • @Frank and Henrik, I planned on trying both of your solutions to see which worked faster/was easier but sadly I've run into a problem with my data set. I get the error that "ICD" needs to be an atomic vector and it appears that the data given to us from t he CDW has the ICD codes not being read as numerical :-/ When I get this fixed I'll try both of your solutions and let you know what happens... Stay tuned... – Jellio Sep 03 '13 at 20:25
  • @Jellio Maybe a stupid question, but: Did you install `plyr` before running the code? `install.packages(c('plyr','data.table'))` I can't think of why/how `ICD` could not be atomic...actually, I've sort of forgotten what that means in R. :) – Frank Sep 03 '13 at 20:29
  • @Jellio, it is not critical that ICD is numeric for my script to run. Also please note that, as Frank pointed out before, that you use `sample` to create example data. You can end up with no ICD with a run length > 1 - `set.seed` may be useful to make examples reproducible. – Henrik Sep 03 '13 at 20:43
  • @Frank That's not a stupid question at all, I am a bit of a neophyte. No, both packages were loaded, it turns out that rle cannot handle 'factors' (they don't count as atomic vectors, who knew?). Found that out from here: http://stackoverflow.com/questions/13311457/r-data-structure-and-rle-function There was a really simple work around that i found here: http://stackoverflow.com/questions/2851015/convert-data-frame-columns-from-factors-to-characters/2853231#2853231 After that, both solutions worked well. – Jellio Sep 04 '13 at 20:49
  • @Jellio. Nice that it worked out finally! When you wrote "ICD codes not being read as numerical" I thought they had been converted to _characters_. That's why I wrote "it is not critical that ICD is numeric". I completely forgot about (the sometimes annoying) factors... – Henrik Sep 04 '13 at 21:06
0

Here's one way:

library(plyr)
my.df <- data.frame(ID=sample(c(1,2,3), 10, replace=TRUE),
                    Date=seq(as.Date("2012-08-01"),
                        as.Date("2012-11-01"), 1)[sample(1:10, 10)], 
                    ICD=c(401.3, 401.3, 250.02, 250.02, 110.1,
                        110.1, 250.02, 250.02, 250.02,112.1))
aggregation.fn <- function(df) {
    df <- arrange(df, Date)
    n <- nrow(df)
    df$consecutive.ICD.are.equal <- c(FALSE, df$ICD[2:n] == df$ICD[1:(n-1)])
    return(df)
}
my.df <- ddply(my.df, .(ID), aggregation.fn)

You can then examine subset(my.df, consecutive.ICD.are.equal & ICD == 250.02).

If your data set is very large you can make ddply run in parallel.

Adrian
  • 3,138
  • 2
  • 28
  • 39
  • I gave this one a shot earlier on the full dataset and unfortunately it caused a memory allocation error due to the data set size :/ I know there are GPU work arounds but I'm on government computers that won't allow that... Thanks for the suggestion though! – Jellio Sep 03 '13 at 20:10