0

What I have: A large file of 1451 rows and 4663 columns with values 0, 1 and NAs. The row names represent sites and the columns names the date of the survey. I would like to sum rows using specific date intervals, that is to sum specific columns referring to the columns name, which represent dates. In this case I have 666 different date intervals through which to sum rows. The specific intervals are in an object type character.

A short reproducible example would be:

    df <- read.table(text ="     2005-09-23  2005-09-24  2005-09-25  2005-09-26    2005-09-27  2005-09-28  2005-09-29  2005-09-30  2005-10-07  2005-10-08
1  0       0     1    NA    1       0     1    NA   1   0          
2  1       1     1    1     1       1     1    1    1   1    
3  NA      NA    NA   NA    NA      NA    NA   NA   NA  NA", header = TRUE)

And a character string with the column names to sum, that is the date intervals, with start and end dates separated by "_":

intervals <- c("2005-09-23_2005-09-26", "2005-09-27_2005-10-30", "2005-10-07_2005-10-08")

What I want: A data frame with the column names that reflect the date intervals, the sites in the rows, and in the cells, the total of the summed values. It is important to have NAs in the sums in which all values where NAs.

result <- read.table(text ="     2005-09-23_2005-09-26  2005-09-27_2005-10-30  2005-10-07_2005-10-08
1  1       2     1           
2  4       4     2     
3  NA      NA    NA", header = TRUE)

My problem: I don't know how to specify a set of column intervals using the column names. I have found useful information related to my problem here but they all require to specify manually the columns over to which to sum, e.g. df_sum <- rowSums(df [,c(1:3)]) which in my case would be 666 date intervals. I think the answer is somewhere along the lines of the following posts and using the rowSums command, however I can't figure out how to adapt those problems to mine. How to get rowSums for selected columns in R and Sum rows in data.frame or matrix and Sum rows by interval Dataframe and R - How to sum objects in a column between an interval defined by conditions on another column

I am not too R savvy, could someone be so kind to provide me with a code to do this? Thank you!

AnnK
  • 189
  • 1
  • 10
  • For the second column, i get 3, 6, NA as output. Can you please check if there is any typo in your 'result' – akrun Feb 21 '20 at 17:52

1 Answers1

0

We convert the column names of the 'df' to Date class, then split the 'intervals' vector at _, into a list of vectors, use comparison operators with 'v1' to create a logical vector, subset the columns of 'df' based on that, and get the rowSums, finally, cbind the list elements

v1 <-  as.Date(names(df), "X%Y.%m.%d")
out <- do.call(cbind.data.frame, lapply(strsplit(intervals, "_"), function(x){
     tmp <- df[v1 >= x[1] & v1 <= x[2]]
     NA^(!rowSums(!is.na(tmp)))* rowSums(tmp, na.rm = TRUE)
  }))
colnames(out) <- intervals
out
#  2005-09-23_2005-09-26 2005-09-27_2005-10-30 2005-10-07_2005-10-08
#1                     1                     3                     1
#2                     4                     6                     2
#3                    NA                    NA                    NA
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Hi Akrun. There isn't any typo in the second column of my result. The rest of the columns are good, I don't know why the second column is not matching. – AnnK Feb 26 '20 at 10:11
  • One thing to note: the date periods that are summed for the second column do not represent all the consecutive days within the time period (i.e. the time period is from Sep 27 to Oct 30 but there is only data from 3 days within this period). Could this be the problem for the incorrect result for this time period sum? – AnnK Feb 26 '20 at 10:52