1

ISSUE ---------

I have thousands of time series files (.csv) that contain intermittent data spanning for between 20-50 years (see df). Each file contains the date_time and a metric (temperature). The data is hourly and where no measurement exists there is an 'NA'.

>df
date_time         temp 
01/05/1943 11:00  5.2
01/05/1943 12:00  5.2
01/05/1943 13:00  5.8
01/05/1943 14:00   NA
01/05/1943 15:00   NA
01/05/1943 16:00  5.8
01/05/1943 17:00  5.8
01/05/1943 18:00  6.3

I need to check these files to see if they have sufficient data density. I.e. that the ratio of NA's to data values is not too high. To do this I have 3 criteria that must be checked for each file:

  1. Ensure that no more than 10% of the hours in a day are NA's
  2. Ensure that no more than 10% of the days in a month are NA's
  3. Ensure that there are 3 continuous years of data with valid days and months.

Each criterion must be fulfilled sequentially and if the file does not meet the requirements then I must create a data frame (or any list) of the files that do not meet the criteria.

QUESTION--------

I wanted to ask the community how to go about this. I have considered the value of nested if loops, along with using sqldf, plyr, aggregate or even dplyr. But I do not know the simplest way to achieve this. Any example code or suggestions would be very much appreciated.

Argalatyr
  • 4,639
  • 3
  • 36
  • 62
Coper
  • 13
  • 3
  • Maybe you better give us the original file content rather than the data.frame you have read from it? And while you are at it, how much data is in each file? Are data for one year possibly in two or more different files? – vaettchen Mar 31 '14 at 02:19
  • The data for one year and for a given location are always in the same file. Thanks for responding. See an example file here: (https://www.dropbox.com/s/tiy62evbc4hl4wn/df.csv) – Coper Mar 31 '14 at 15:50
  • Is my understanding correct: (1) check the days and mark all days that have more than 10% NA, then (2) check the months and mark all months that have 3 or more days checked in the previous step, then (3) check the years and make sure that there are at least three continuous years where no days and months have been marked in the previous steps. A file that does not have these minimum three years, goes to the blacklist you want to produce. – vaettchen Apr 01 '14 at 22:25
  • Your summary is correct. – Coper Apr 02 '14 at 01:52
  • You still need that? I'm working on it but I have little time, will come up with an answer over the weekend. – vaettchen Apr 03 '14 at 22:05
  • I always welcome alternate solutions to a problem for myself and others but please prioritise your time over a response. – Coper Apr 04 '14 at 20:41
  • Since you have a solution already, my motivation faded ;-) -- what I tried is moving the whole stuff into an sqlite database and doing a lot of the work with rsqlite. With this amount of data, there is a large potential for speeding up things. – vaettchen Apr 07 '14 at 05:11

1 Answers1

0

I think this will work for you. These will check every hour for NA's in the next day, month or 3 year period. Not tested because I don't care to make up data to test it. These functions should spit out the number of NA's in the respective time period. So for function checkdays if it returns a value greater than 2.4 then according to your 10% rule you'd have a problem. For months 72 and for 3 year periods you're hoping for values less than 2628. Again please check these functions. By the way the functions assume your NA data is in column 2. Cheers.

checkdays <- function(data){
countNA=NULL
for(i in 1:(length(data[,2])-23)){
nadata=data[i:(i+23),2]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}

checkmonth <- function(data){
countNA=NULL
for(i in 1:(length(data[,2])-719)){
nadata=data[i:(i+719),2]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}

check3years <- function(data){
countNA=NULL
for(i in 1:(length(data[,2])-26279)){
nadata=data[i:(i+26279),2]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}

So I ended up testing these. They work for me. Here are system times for a dataset a year long. So I don't think you'll have problems.

> system.time(checkdays(RM_W1))
   user  system elapsed 
   0.38    0.00    0.37 
> system.time(checkmonth(RM_W1))
   user  system elapsed 
   0.62    0.00    0.62

Optimization: I took the time to run these functions with the data you posted above and it wasn't good. For loops are dangerous because they work well for small data sets but slow down exponentially as datasets get larger, that is if they're not constructed properly. I cannot report system times for the functions above with your data (it never finished) but I waited about 30 minutes. After reading this awesome post Speed up the loop operation in R I rewrote the functions to be much faster. By minimising the amount of things that happen in the loop and pre-allocating memory you can really speed things up. You need to call the function like checkdays(df[,2]) but its faster this way.

checkdays <- function(data){
countNA=numeric(length(data)-23)
for(i in 1:(length(data)-23)){
nadata=data[i:(i+23)]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}
> system.time(checkdays(df[,2]))
   user  system elapsed 
   4.41    0.00    4.41 

I believe this should be sufficient for your needs. In regards to leap years you should be able to modify the optimized function as I mentioned in the comments. However make sure you specify a leap year dataset as second dataset rather than a second column.

Community
  • 1
  • 1
CCurtis
  • 1,770
  • 3
  • 15
  • 25
  • This might take a little while to run so if you don't want to check every hour and instead want to check every day then change `for(i in 1:(length(data)-23))` to `for(i in 1:(length(data)/24-1))` and adjust `nadata=data[i:(i+23),2]` to `nadata=data[(i*24-23):(i*24),2]`. – CCurtis Mar 31 '14 at 04:44
  • Thank you so much for your help. This is not an approach that I had considered and will likely save me time. As always additional suggestions are welcome. (I have also been wondering how to deal with leap years in this issue). – Coper Mar 31 '14 at 16:11
  • Because you're calculating every hour and not every year, leap years are a little irrelevant I think. What I mean is after the first day of the leap year you would be grabbing NA's an extra day into the next year. Would you do the same thing for months? They're different lengths too. Really it just depends on how you want to do things. I'm not going to say there is a right answer here. If you want sample an extra day for NA's whenever you're in a the leap year you could simply add another column specifying if it was a leap year or not. 1 for leap year 0 for not. – CCurtis Mar 31 '14 at 19:05
  • Then you can change `nadata=data[i:(i+26279),2]` to `nadata=data[i:(i+26279+data[i,3]*24),2]`. This should work as long as you don't end on a leap year. So if you do just specify the last day of the leap year an a normal year. That is for the last make sure column three is 0 and not 1. – CCurtis Mar 31 '14 at 19:37
  • Your welcome. By the way I forgot to mention, I ran the 3 year function and it is much slower. Takes about 20 minutes on my machine. This is just because its indexing all the NA's over a 3 year period rather than a day. For the 3 year function I would suggest sampling NA's every month rather than every hour like the daily function. For the monthly function a daily sampling interval is probably appropriate. Good question and best of luck with your calculations, sounds like a real bear given that you say you have thousands of these files. – CCurtis Apr 01 '14 at 01:12