1

as I failed to solve my problem with PHP/MySQL or Excel due to the data size, I'm trying to do my very first steps with R now and struggle a bit. The problem is this: I have a second-by-second CSV-file with half a year of data, that looks like this:

metering,timestamp
123,2016-01-01 00:00:00
345,2016-01-01 00:00:01
243,2016-01-01 00:00:02
101,2016-01-01 00:00:04
134,2016-01-01 00:00:06

As you see, there are some seconds missing every once in a while (don't ask me, why the values are written before the timestamp, but that's how I received the data…). Now I try to calculate the amount of values (= seconds) that are missing.

So my idea was

  1. to create a vector that is correct (includes all sec-by-sec timestamps),
  2. match the given CSV file with that new vector, and
  3. sum up all the timestamps with no value.

I managed to make step 1 happen with the following code:

RegularTimeSeries <- seq(as.POSIXct("2016-01-01 00:00:00", tz = "UTC"), as.POSIXct("2016-01-01 00:00:30", tz = "UTC"), by = "1 sec")  
write.csv(RegularTimeSeries, file = "RegularTimeSeries.csv")

To have an idea what I did I also exported the vector to a CSV that looks like this:

"1",2016-01-01 00:00:00
"2",2016-01-01 00:00:01
"3",2016-01-01 00:00:02
"4",2016-01-01 00:00:03
"5",2016-01-01 00:00:04
"6",2016-01-01 00:00:05
"7",2016-01-01 00:00:06

Unfortunately I have no idea how to go on with step 2 and 3. I found some very similar examples (http://www.r-bloggers.com/fix-missing-dates-with-r/, R: Insert rows for missing dates/times), but as a total R noob I struggled to translate these examples to my given sec-by-sec data.

Some hints for the greenhorn would be very very helpful – thank you very much in advance :)

Community
  • 1
  • 1
Bene
  • 41
  • 8
  • 1
    well, I would say that rather than doing your method, just subtract the timestamp with the value before it. If it is greater than 1, then output the value. Other wise leave it. Will share code in some time. – Dinesh.hmn Jul 26 '16 at 17:47
  • Thank you! Yes, subtracting the "should be" seconds by the lines of data I have is the most obvious solution (so obvious that I haven't had this idea, to be honest). But I had some further analysis in my mind anyway, like finding out where the longest data blackouts occured, how long they where, and so on. To do that I need a data set with "NA" values anyway. But for the first steps, your solution is of course as brilliant as simple :) – Bene Jul 26 '16 at 19:09
  • I have given my code below as an answer, Bene. Check and let me know if it helps – Dinesh.hmn Jul 27 '16 at 12:24

4 Answers4

2

In the tidyverse,

library(dplyr)
library(tidyr)

       # parse datetimes
df %>% mutate(timestamp = as.POSIXct(timestamp)) %>% 
    # complete sequence to full sequence from min to max by second
    complete(timestamp = seq.POSIXt(min(timestamp), max(timestamp), by = 'sec'))

## # A tibble: 7 x 2
##             timestamp metering
##                <time>    <int>
## 1 2016-01-01 00:00:00      123
## 2 2016-01-01 00:00:01      345
## 3 2016-01-01 00:00:02      243
## 4 2016-01-01 00:00:03       NA
## 5 2016-01-01 00:00:04      101
## 6 2016-01-01 00:00:05       NA
## 7 2016-01-01 00:00:06      134

If you want the number of NAs (i.e. the number of seconds with no data), add on

%>% tally(is.na(metering))

## # A tibble: 1 x 1
##       n
##   <int>
## 1     2
alistaire
  • 42,459
  • 4
  • 77
  • 117
  • 1
    Thank you very much, this looks like what I am searching for! Unfortunately it doesn't produce any result – do I have to add some lines somewhere? I guess "df" is the vector with my original data, that I can read in with `df <- read.csv(fpath, header=TRUE, sep=",")`, right? – Bene Jul 26 '16 at 19:06
  • You can just use `df <- read.csv(fpath)`; the other parameters are already defaults. To save the results, assign them to something, e.g. `df2 <- df %>% mutate(....` – alistaire Jul 26 '16 at 19:12
2

You can check which values of your RegularTimeSeries are in your broken time series using which and %in%. First create BrokenTimeSeries from your example:

RegularTimeSeries <- seq(as.POSIXct("2016-01-01 00:00:00", tz = "UTC"), as.POSIXct("2016-01-01 00:00:30", tz = "UTC"), by = "1 sec")
BrokenTimeSeries <- RegularTimeSeries[-c(3,6,9)] # remove some seconds

This will give you the indeces of values within RegularTimeSeries that are not in BrokenTimeSeries:

> which(!(RegularTimeSeries %in% BrokenTimeSeries))
[1] 3 6 9

This will return the actual values:

> RegularTimeSeries[which(!(RegularTimeSeries %in% BrokenTimeSeries))]
[1] "2016-01-01 00:00:02 UTC" "2016-01-01 00:00:05 UTC" "2016-01-01 00:00:08 UTC"

Maybe I'm misunderstanding your problem but you can count the number of missing seconds simply subtracting the length of your broken time series from RegularTimeSeries or getting the length of any of the two resulting vectors above.

> length(RegularTimeSeries) - length(BrokenTimeSeries)
[1] 3
> length(which(!(RegularTimeSeries %in% BrokenTimeSeries)))
[1] 3
> length(RegularTimeSeries[which(!(RegularTimeSeries %in% BrokenTimeSeries))])
[1] 3

If you want to merge the files together to see the missing values you can do something like this:

#data with regular time series and a "step"
df <- data.frame(
  RegularTimeSeries
)

df$BrokenTimeSeries[RegularTimeSeries %in% BrokenTimeSeries] <- df$RegularTimeSeries
df$BrokenTimeSeries <- as.POSIXct(df$BrokenTimeSeries, origin="2015-01-01", tz="UTC")

resulting in:

> df[1:12,]
     RegularTimeSeries    BrokenTimeSeries
1  2016-01-01 00:00:00 2016-01-01 00:00:00
2  2016-01-01 00:00:01 2016-01-01 00:00:01
3  2016-01-01 00:00:02                <NA>
4  2016-01-01 00:00:03 2016-01-01 00:00:02
5  2016-01-01 00:00:04 2016-01-01 00:00:03
6  2016-01-01 00:00:05                <NA>
7  2016-01-01 00:00:06 2016-01-01 00:00:04
8  2016-01-01 00:00:07 2016-01-01 00:00:05
9  2016-01-01 00:00:08                <NA>
10 2016-01-01 00:00:09 2016-01-01 00:00:06
11 2016-01-01 00:00:10 2016-01-01 00:00:07
12 2016-01-01 00:00:11 2016-01-01 00:00:08
Warner
  • 1,353
  • 9
  • 23
0

If all you want is the number of missing seconds, it can be done much more simply. First find the number of seconds in your timerange, and then subtract the number of rows in your dataset. This could be done in R along these lines:

n.seconds <- difftime("2016-06-01 00:00:00", "2016-01-01 00:00:00", units="secs")
n.rows <- nrow(my.data.frame)
n.missing.values <- n.seconds - n.rows

You might change the time range and the variable of your data frame.

homer
  • 238
  • 3
  • 9
  • If you use `min` and `max` to calculate the times to subtract (and maybe account for existing `NA`s), this is a nice solution. – alistaire Jul 26 '16 at 17:55
0

Hope it helps

d <- (c("2016-01-01 00:00:01",
"2016-01-01 00:00:02",
"2016-01-01 00:00:03",
"2016-01-01 00:00:04",
"2016-01-01 00:00:05",
"2016-01-01 00:00:06",
"2016-01-01 00:00:10",
"2016-01-01 00:00:12",
"2016-01-01 00:00:14",
"2016-01-01 00:00:16",
"2016-01-01 00:00:18",
"2016-01-01 00:00:20",
"2016-01-01 00:00:22"))
d <- as.POSIXct(d)

for (i in 2:length(d)){
  if(difftime(d[i-1],d[i], units = "secs") < -1 ){
    c[i] <- d[i]
  }
  }
 class(c) <- c('POSIXt','POSIXct')
 c
 [1] NA                        NA                        NA                       
 NA                        NA                       
[6] NA                        "2016-01-01 00:00:10 EST" "2016-01-01 00:00:12    
EST" "2016-01-01 00:00:14 EST" "2016-01-01 00:00:16 EST"
[11] "2016-01-01 00:00:18 EST" "2016-01-01 00:00:20 EST" "2016-01-01    
00:00:22 EST"
Dinesh.hmn
  • 713
  • 7
  • 21