Trouble finding non-unique index entries in zooreg time series

Question

I have several years of data that I'm trying to work into a zoo object (.csv at Dropbox). I'm given an error once the data is coerced into a zoo object. I cannot find any duplicated in the index.

df <- read.csv(choose.files(default = "", caption = "Select data source", multi = FALSE), na.strings="*")
df <- read.zoo(df, format = "%Y/%m/%d %H:%M", regular = TRUE, row.names = FALSE, col.names = TRUE, index.column = 1)
Warning message:
In zoo(rval3, ix) :
  some methods for “zoo” objects do not work if the index entries in ‘order.by’ are not unique

I've tried:

sum(duplicated(df$NST_DATI))

But the result is 0.

Thanks for your help!

Including a link to your actual data was brilliant. +1. – jlhoward Dec 08 '14 at 18:08 — jlhoward, Dec 08 '14 at 18:08

score 5 · Accepted Answer · edited Feb 21 '16 at 18:44

You are using read.zoo(...) incorrectly. According to the documentation:

To process the index, read.zoo calls FUN with the index as the first argument. If FUN is not specified then if there are multiple index columns they are pasted together with a space between each. Using the index column or pasted index column: 1. If tz is specified then the index column is converted to POSIXct. 2. If format is specified then the index column is converted to Date. 3. Otherwise, a heuristic attempts to decide among "numeric", "Date" and "POSIXct". If format and/or tz is specified then they are passed to the conversion function as well.

You are specifying format=... so read.zoo(...) converts everything to Date, not POSIXct. Obviously, there are many, many duplicated dates.

Simplistically, the correct solution is to use:

df <- read.zoo(df, FUN=as.POSIXct, format = "%Y/%m/%d %H:%M")
# Error in read.zoo(df, FUN = as.POSIXct, format = "%Y/%m/%d %H:%M") : 
#   index has bad entries at data rows: 507 9243 18147 26883 35619 44355

but as you can see this does not work either. Here the problem is much more subtle. The index is converted using POSIXct, but in the system time zone (which on my system is US Eastern). The referenced rows have timestamps that coincide with the changeover from Standard to DST, so these times do not exist in the US Eastern timezone. If you use:

df <- read.zoo(df, FUN=as.POSIXct, format = "%Y/%m/%d %H:%M", tz="UTC")

the data imports correctly.

EDIT:

As @G.Grothendieck points out, this would also work, and is simpler:

df <- read.zoo(df, tz="UTC")

You should set tz to whatever timezome is appropriate for the dataset.

If `tz` is specified it will know that POSIXct is wanted (as the alternatives it considers don't use time zones) so `FUN` could be omittted. — G. Grothendieck, Dec 08 '14 at 18:44
Thank you so much for your help. You have no idea how much this is appreciated. I haven't implemented your solution **just** yet, but I will within the next few days. I'll let you know how it turns out. I love this community. — Ryan Pugh, Dec 09 '14 at 15:58

Trouble finding non-unique index entries in zooreg time series

1 Answers1

Linked