1

I am working on a data set, which has two columns: id, date/time. Please find the example below,

id  date_time
1   2016-10-29 18:01:03.0000000 +08:00
1   2016-10-29 19:34:17.0000000 +08:00
1   2016-10-30 14:08:03.0000000 +08:00
1   2016-10-30 15:55:12.0000000 +08:00
2   2016-10-31 11:32:12.0000000 +08:00
2   2016-10-31 14:59:56.0000000 +08:00
2   2016-11-01 12:49:44.0000000 +08:00
2   2016-11-01 13:55:16.0000000 +08:00
2   2016-11-01 19:18:22.0000000 +08:00
2   2016-11-01 20:40:48.0000000 +08:00
3   2016-11-01 21:19:50.0000000 +08:00
3   2016-11-02 14:20:15.0000000 +08:00
3   2016-11-02 18:52:27.0000000 +08:00
3   2016-11-02 19:39:32.0000000 +08:00
3   2016-11-03 08:55:41.0000000 +08:00

All I wanted to obtain is two columns such that: column 1 has cumulative occurrences for each id ordered using date and time and column 2 has cumulative dates for each id as shown in the table below,

id  date_time                           occ date
1   2016-10-29 18:01:03.0000000 +08:00  1   1
1   2016-10-29 19:34:17.0000000 +08:00  2   1
1   2016-10-30 14:08:03.0000000 +08:00  3   2
1   2016-10-30 15:55:12.0000000 +08:00  4   2
2   2016-10-31 11:32:12.0000000 +08:00  1   1
2   2016-10-31 14:59:56.0000000 +08:00  2   1
2   2016-11-01 12:49:44.0000000 +08:00  3   2
2   2016-11-01 13:55:16.0000000 +08:00  4   2
2   2016-11-01 19:18:22.0000000 +08:00  5   2
2   2016-11-01 20:40:48.0000000 +08:00  6   2
3   2016-11-01 21:19:50.0000000 +08:00  1   1
3   2016-11-02 14:20:15.0000000 +08:00  2   2
3   2016-11-02 18:52:27.0000000 +08:00  3   2
3   2016-11-02 19:39:32.0000000 +08:00  4   2
3   2016-11-03 08:55:41.0000000 +08:00  5   3
  1. (Note that +8:00 is there as a redundant). To generate column 1 (occ): I have tried using ave with FUN=seq_along by first splitting date and time, followed by order using id, date and time.

    Q1: Is there any way I can directly sort the date_time column ?

  2. For column 2(date), I have first taken a subset of the data frame, using unique value I am generating the index using ave and seq_along. After that I am merging the two data set in a loop.

    Q2: Is there a more efficient method to achieve the same ?

gung - Reinstate Monica
  • 11,583
  • 7
  • 60
  • 79
vivek
  • 301
  • 3
  • 13

1 Answers1

1

It is unclear to me what format your date_time variable is in. I am assuming it is POSIXct. I have trimmed off the junk and converted it to that.

d <- read.table(text="id,  date_time
1,   2016-10-29 18:01:03.0000000 +08:00
...
3,   2016-11-03 08:55:41.0000000 +08:00", header=TRUE, sep=",")
d$date_time <- as.POSIXct(substr(as.character(d$date_time), 4, 22))

At this point you can sort the data frame, including by dates, using ?order (see also: Understanding the order() function):

d <- d[order(d$id, d$date_time),]

With the data frame sorted, to count up rows within each id, you can use ?tapply. You can likewise use tapply to label unique days by composing as.character and as.Date, and as.numeric and factor. Consider:

d$occ  <- unlist(with(d, tapply(id, id, FUN=function(x){ 1:length(x) })))
d$date <- unlist(with(d, tapply(date_time, id, FUN=function(x){ 
                                  x = as.character(as.Date(x))
                                  as.numeric(factor(x, levels=unique(x))) 
                                })))
d
#    id           date_time occ date
# 1   1 2016-10-29 18:01:03   1    1
# 2   1 2016-10-29 19:34:17   2    1
# 3   1 2016-10-30 14:08:03   3    2
# 4   1 2016-10-30 15:55:12   4    2
# 5   2 2016-10-31 11:32:12   1    1
# 6   2 2016-10-31 14:59:56   2    1
# 7   2 2016-11-01 12:49:44   3    2
# 8   2 2016-11-01 13:55:16   4    2
# 9   2 2016-11-01 19:18:22   5    2
# 10  2 2016-11-01 20:40:48   6    3
# 11  3 2016-11-01 21:19:50   1    1
# 12  3 2016-11-02 14:20:15   2    1
# 13  3 2016-11-02 18:52:27   3    1
# 14  3 2016-11-02 19:39:32   4    1
# 15  3 2016-11-03 08:55:41   5    2
gung - Reinstate Monica
  • 11,583
  • 7
  • 60
  • 79