1

I have a large dataset of observations, with several observations in rows and several different variables for each ID. e.g.

Data

ID   V1   V2   V3   time
1    35   100  5.2  2015-07-03 07:49
2    25   111  6.2  2015-04-01 11:52
3    41   120  NA   2015-04-01 14:17
1    25   NA   NA   2015-07-03 07:51 
2    NA   122  6.2  2015-04-01 11:50
3    40   110  4.1  2015-04-01 14:25

I would like to extract the earliest (first) observation for each variable independently based on the time column, for each unique ID. i.e. I would like to combine multiple rows of the same ID together so that I have one row of the first observation for each variable (time variable will not be equal for all).

The min() function will return the earliest time for a set of observations, but the problem is I need to do this for each variable. To do this I have tried using the tapply function with minimum time

tapply(Data, ID, min(time)

but get an error saying

"Error in match.fun(FUN) : 'min(Data$time)' is not a function, character or symbol.

I suspect that there is also a problem because many of the rows of observations have missing data.

Alternatively I have tried to just do each variable one at a time using aggregate, and select the min(time) this way:

firstV1 <-aggregate(V1[min(time)]~ID, data=Data, na.rm=T)

From the example dataset, what I would like to see is:

Data
    ID   V1   V2   V3   
    1    35   100  5.2
    2    25   122  6.2
    3    41   120  4.1  

Note the '25' for ID2 V1 was from the later observation because the first observation was missing. Same for ID3 V3.


Input data

structure(list(ID = c(1L, 2L, 3L, 1L, 2L, 3L), V1 = c(35L, 25L, 
41L, 25L, NA, 40L), V2 = c(100L, 111L, 120L, NA, 122L, 110L), 
    V3 = c(5.2, 6.2, 4.2, NA, 6.2, 4.1), time = structure(c(1435906140, 
    1427885520, 1427894220, 1435906260, 1427885400, 1427894700
    ), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("ID", 
"V1", "V2", "V3", "time"), row.names = c(NA, -6L), class = "data.frame")
user20650
  • 24,654
  • 5
  • 56
  • 91
DLane
  • 23
  • 4
  • 1
    Possible duplicate of [Extract row corresponding to minimum value of a variable by group](http://stackoverflow.com/questions/24070714/extract-row-corresponding-to-minimum-value-of-a-variable-by-group) and http://stackoverflow.com/questions/4189807/only-keep-min-value-for-each-factor-level – user20650 Jan 21 '17 at 00:40
  • I am looking for the minimum time for each variable, not the row with the minimum time – DLane Jan 24 '17 at 14:01
  • I took *like to extract the earliest (first) observation* as wanting the row of the data that contained the earliest time for each ID; that is what the links answer. However, if you just want the earliest time for each ID (ie not the full row of data) you could use `aggregate(time ~ ID, dat, min)` (although you could also use the answers in the link and subset). If this is not what you want , please edit your question with the expected result from your small example dataset. – user20650 Jan 24 '17 at 14:09
  • oops sorry, I just noticed your edit. It's a bit unclear what your output would be, so can you what you expect from your data. cheers – user20650 Jan 24 '17 at 14:33
  • 1
    @user20650 Thank you again for your help. However the aggregate code you provided works only to identify what the first time is, I need the value within the column associated with this time. But then if the value is NA, I need to look for the next observation in time and extract that. – DLane Jan 25 '17 at 18:09
  • Seems a strategy would be to order your data by id and time, then use `zoo::na.locf` to fill in the missing. Then you just need to extract the first row by group. Ive tried it with data.table (but i suck at it).. `DT <- setDT(dat)[order(ID, time)] ; DT[,(c("V1", "V2", "V3")):=lapply(.SD, zoo::na.locf, fromLast=TRUE), by=ID] ; DT[, .SD[1], by="ID"]` – user20650 Jan 25 '17 at 18:33
  • Voted to reopen as its slightly different from the duplicate questions – user20650 Jan 25 '17 at 18:33

1 Answers1

0

This should do what you need.

library(data.table)
Data <- rbind(cbind(1,35,100,5.2,"2015-07-03 07:49"),
              cbind(2,25,111,6.2,"2015-04-01 11:52"), 
              cbind(3,41,120,4.2,"2015-04-01 14:17"),
              cbind(1,25,NA,NA,"2015-07-03 07:51"),
              cbind(2,NA,122,6.2,"2015-04-01 11:50"),
              cbind(3,40,110,4.1,"2015-04-01 14:25"))
colnames(Data) <- c("ID","V1","V2","V3","time")
Data <- data.table(Data)
class(Data[,time])
Data[,time:=as.POSIXct(time)]
minTime.Data <- Data[,lapply(.SD, function(x) x[time==min(time)]),by=ID]
minTime.Data

The outcome will be

   ID V1  V2  V3                time
1:  1 35 100 5.2 2015-07-03 07:49:00
2:  2 NA 122 6.2 2015-04-01 11:50:00
3:  3 41 120 4.2 2015-04-01 14:17:00

Let me know if this is what you were looking for, because there is a little ambiguity in your question.

  • Thank you for your reply, and pointing out the similar article! I was able to recreate the code you provided in my data successfully, however there were two problems. First, a number of duplicate records by ID came up (i.e. two rows with the same ID). And more importantly, I would like first value that is present, ignoring NAs. So in the output above, ID2 and V1 I would want 100 not NA even though it is from another row. – DLane Jan 22 '17 at 14:29
  • Here was the final code that worked: library(data.table) library(zoo) DT<- data.table(Data) #Import to Data.table format class(DT[,time]) DT[,time:=as.POSIXct(time)] #Set to POSIXct DT <- setDT(DT)[order(ID, time)] #order based on ID then time DT <- DT[,lapply(.SD, function(x) na.locf(x, fromLast=T)), by=ID] #Carry later observations earlier if NA exists DT <- DT[,.SD[1], by=ID] #extract first row of observations – DLane Jan 27 '17 at 18:19