1

I have a time series, and would like to get the information of the last observation of each month. This question is not about generating a new time series, but finding the last observation of each month among an existing time series. The last observation may not be the last day of a month. The following is just a small example,

date <- c(ymd(20010129, 20010228, 20010330, 20010429), ymd(20010501) + days(1:90))

# "2001-01-29" "2001-02-28" "2001-03-30" "2001-04-29" "2001-05-02" "2001-05-03" "2001-05-04" "2001-05-05"
# "2001-05-06" "2001-05-07" "2001-05-08" "2001-05-09" "2001-05-10" "2001-05-11" "2001-05-12" "2001-05-13"
# "2001-05-14" "2001-05-15" "2001-05-16" "2001-05-17" "2001-05-18" "2001-05-19" "2001-05-20" "2001-05-21"
# "2001-05-22" "2001-05-23" "2001-05-24" "2001-05-25" "2001-05-26" "2001-05-27" "2001-05-28" "2001-05-29"
# "2001-05-30" "2001-05-31" "2001-06-01" "2001-06-02" "2001-06-03" "2001-06-04" "2001-06-05" "2001-06-06"
# "2001-06-07" "2001-06-08" "2001-06-09" "2001-06-10" "2001-06-11" "2001-06-12" "2001-06-13" "2001-06-14"
# "2001-06-15" "2001-06-16" "2001-06-17" "2001-06-18" "2001-06-19" "2001-06-20" "2001-06-21" "2001-06-22"
# "2001-06-23" "2001-06-24" "2001-06-25" "2001-06-26" "2001-06-27" "2001-06-28" "2001-06-29" "2001-06-30"
# "2001-07-01" "2001-07-02" "2001-07-03" "2001-07-04" "2001-07-05" "2001-07-06" "2001-07-07" "2001-07-08"
# "2001-07-09" "2001-07-10" "2001-07-11" "2001-07-12" "2001-07-13" "2001-07-14" "2001-07-15" "2001-07-16"
# "2001-07-17" "2001-07-18" "2001-07-19" "2001-07-20" "2001-07-21" "2001-07-22" "2001-07-23" "2001-07-24"
# "2001-07-25" "2001-07-26" "2001-07-27" "2001-07-28" "2001-07-29" "2001-07-30"

I want to keep the observation of "2001-01-29", "2001-02-28", "2001-03-30", "2001-04-29", "2001-05-31", "2001-06-30", and "2001-07-30". Is there a way to achieve it?

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
morningfin
  • 329
  • 2
  • 10

3 Answers3

3

You can group the date by month and calculate the maximum value:

library(lubridate)
unique(ave(date, month(date), FUN = max))

# [1] "2001-01-29" "2001-02-28" "2001-03-30" "2001-04-29"
# [5] "2001-05-31" "2001-06-30" "2001-07-30"
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • But my data set have observations across year, your answer can only give 12 observation. I need the last observation of each month in every year. Thanks again! – morningfin Aug 22 '16 at 16:49
  • You can add `year(date)` as group variable. something like: `unique(ave(date, month(date), year(date), FUN = max))` – Psidom Aug 22 '16 at 16:52
  • If you're going to uniqify it, I guess `tapply` makes more sense than `ave`. – Frank Aug 22 '16 at 17:01
  • 1
    @Frank `tapply` seems to make the class of the original data lost. But `as.Date(tapply(date, month(date), FUN = max))` is still a good alternative. – Psidom Aug 22 '16 at 17:11
2

We can use data.table. Convert the 'date' vector to data.table, grouped by the year and month of 'date' we get the max of 'date'.

library(data.table)
as.data.table(date)[, .(Date = max(date)), .(Year = year(date), Month = month(date))]
#   Year Month       Date
#1: 2001     1 2001-01-29
#2: 2001     2 2001-02-28
#3: 2001     3 2001-03-30
#4: 2001     4 2001-04-29
#5: 2001     5 2001-05-31
#6: 2001     6 2001-06-30
#7: 2001     7 2001-07-30

Or using base R with a straightforward tapply based approach than getting a vector of the same length as the original vector and then take the unique.

do.call("c", tapply(date, list(month(date), year(date)), 
                FUN = function(x) list(max(x))))
#[1] "2001-01-29" "2001-02-28" "2001-03-30" "2001-04-29" "2001-05-31" 
#[6] "2001-06-30" "2001-07-30"

Or in a concise way

 unname(as.Date(tapply(date, substr(date, 1,7), FUN = max), origin = "1970-01-01"))
 #[1] "2001-01-29" "2001-02-28" "2001-03-30" "2001-04-29" "2001-05-31" 
 #[6] "2001-06-30" "2001-07-30"

Also, we can get the output without any grouping by checking the adjacent elements (assuming it is ordered) and it should be very efficient.

v1 <- substr(date, 1, 7)
date[c(v1[-1]!= v1[-length(v1)], TRUE)]
[1] "2001-01-29" "2001-02-28" "2001-03-30" "2001-04-29" "2001-05-31" 
[6] "2001-06-30" "2001-07-30"

Benchmarks

date1 <- c(ymd(20010129, 20010228, 20010330, 20010429), ymd(20010501) + days(1:1e6))
system.time(as.data.table(date1)[, .(Date = max(date1)), 
      .(Year = year(date1), Month = month(date1))])
#   user  system elapsed 
#   5.53    0.05    5.58  


system.time({
 v1 <- substr(date1, 1, 7)
 date1[c(v1[-1]!= v1[-length(v1)], TRUE)]
})
# user  system elapsed 
#  10.25    0.23   10.49 

Based on the above performance, data.table methods are very efficient though the base R comparison between adjacent elements are also not that behind while all that glitters is not gold.

system.time(unique(ave(date1, year(date1), month(date1), FUN = max)))
#   user  system elapsed 
# 242.35  120.80  364.55 
akrun
  • 874,273
  • 37
  • 540
  • 662
0

endpoints a function from the xts package does exactly what its name suggests:

> date[endpoints(date,on='months')]
[1] "2001-01-29" "2001-02-28" "2001-03-30" "2001-04-29" "2001-05-31"
[6] "2001-06-30" "2001-07-30”

Valid values for the argument on include: “us” (microseconds), “microseconds”, “ms” (milliseconds), “milliseconds”, “secs” (seconds), “seconds”, “mins” (minutes), “minutes”, “hours”, “days”, “weeks”, “months”, “quarters”, and “years”.

hvollmeier
  • 2,956
  • 1
  • 12
  • 17