6

I am experimenting with R to analyse some measurement data. I have a .csv file containing more than 2 million lines of measurement. Here is an example:

2014-10-22 21:07:03+00:00,7432442.0
2014-10-22 21:07:21+00:00,7432443.0
2014-10-22 21:07:39+00:00,7432444.0
2014-10-22 21:07:57+00:00,7432445.0
2014-10-22 21:08:15+00:00,7432446.0
2014-10-22 21:08:33+00:00,7432447.0
2014-10-22 21:08:52+00:00,7432448.0
2014-10-22 21:09:10+00:00,7432449.0
2014-10-22 21:09:28+00:00,7432450.0

After reading in the file, I want to convert the time to a correct time, using as.POSIXct(). For small files this works fine, but for large files it does not.

I made an example by reading in a big file, creating a copy of a small portion and then unleashing the as.POSIXct() on the correct column. I included an image of the file. As you can see, when applying it to the temp-variable it does correctl keep the hours, minutes and seconds. However, when applying it to the whole file, only the date is stored. (it also takes a LOT of time (more than 2 minutes))

POSIXct() error

What could cause this anomality? Is it due to some system limits, since I'm running this on my laptop.

Edit

On my Windows 7 device I run R 3.1.3 which results in this error. However, on Ubuntu 14.01, running R 3.0.2, the times are kept for the large files. Just noticed there is a newer version (3.2.0) for Windows, will update and check if the issue persists.

jdepypere
  • 3,453
  • 6
  • 54
  • 84
  • 1
    I cannot reproduce the problem with the data you've provided. Can you find a minimally [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)? Also, what version of R and what OS are you using? – MrFlick May 04 '15 at 20:05

3 Answers3

8

Maybe the reason for your problem is that you have dates without time somewhere in your data set. Try the following example:

  library(lubridate)
  dates <- as.character(now() + minutes(1:5))
  dates <- c(dates,"2015-05-10")
  as.POSIXct(dates[1:5])
  as.POSIXct(dates)

It first creates a vector dates containing 6 dates with times and converts them to character. Then I add another date (as a character) that does not contain a time. When you run the two conversions to POSIXct, you'll notice that the times are gone in the result, as soon as you include the date without time.

So there seems to be no date without time in the first few rows of your data, but later there maybe will be. There are most likely many solutions for this problem and I'll just propose one that came to my mind.

The first step is to change your read command, such that the dates are stored as characters instead of factors:

data <- read.csv("C:/RData/house2_electricity_Main.csv",header=FALSE,stringsAsFactors=FALSE)

Then you can try to add the time to all the dates that have none and convert to POSIXct only afterwards:

data$V1 <- ifelse(nchar(data$V1) > 11,data$V1, paste0(data$V1,"00:00:00"))
data$V1 <- as.POSIXct(data$V1)

This worked for my little example above. It is not the most elegant solution and maybe someone has a better idea.

Stibu
  • 15,166
  • 6
  • 57
  • 71
5

I had a similar problem with as.POSIXlt(X) dropping the hour:minute:second information, with X being a vector of POSIXct objects, that happened to have tzone="UTC".

However, as.POSIXlt(X, tz="UTC") kept the hour:minute:second information.

jacefarm
  • 6,747
  • 6
  • 36
  • 46
Andrew
  • 51
  • 1
  • 1
3

You can try the code below.
It will:

  • read datetime type as character instead of factor
  • update by reference

library(data.table)
data <- fread("C:/RData/house2_electricity_main.csv")
data[, V1 := as.POSIXct(V1)]

There was a question recently about usage of fasttime::fastPOSIXct instead of as.POSIXct which can additionally speed up.

As for the title question, having POSIXct you can round it quite freely, e.g. functions year,month,mday...

data[, .SD, by = .(year(V1),month(V1),mday(V1))]
jangorecki
  • 16,384
  • 4
  • 79
  • 160
  • @jdepypere did you use other lower level rounding functions than listed *year,month,mday*? – jangorecki May 06 '15 at 12:14
  • Without any rounding functions (only your first code-block) the times are dropped when I print out an entire entry. – jdepypere May 06 '15 at 12:16
  • 1
    @jdepypere did you try `options("digits.secs")`? this may be related: http://stackoverflow.com/a/7730759/2490497 – jangorecki Jun 03 '15 at 12:30
  • I have the same problem, and reading the answers and comments it doesn't seem as the problem was resolved (the solutions didn't work for me in any case). Why did you mark @jangorecki answers as "accepted"? Did you resolve the problem? – Ratnanil Nov 24 '15 at 09:16
  • @Ratnanil did you try `as.POSIXct(as.character(.))` ? factor and the integers internally – jangorecki Nov 24 '15 at 09:22
  • @jangorecki yes I've tried that, but it didn't resolve the problem. The time values are dropped if I transform the whole set of 300'000 rows. If I subset 19704 or less, the time values are kept. Weird right? I'm working on a minimal example, which is proving to be hard since I can't seem to reproduce the error with synthetic data. I'll re-ask this question in a new post, if I cant find a work around. – Ratnanil Nov 26 '15 at 08:57
  • 2
    @jangorecki turns out that [Stibu's answer](http://stackoverflow.com/questions/30038701/r-as-posixct-dropping-hours-minutes-and-seconds/30039313#30039313) was the solution for me! value number 19705 was a Date Value without a "Time" component.. :-/ – Ratnanil Nov 26 '15 at 09:17