4

I've two dataframes, Data and quantiles. Data has a dimension of 23011 x 2 and consists of columns "year" and "data" where year are the sequence of days from 1951:2013. The Quantiles df has a dimension of 63x2 consists of columns "year" and "quantiles" , where year are 63 rows, ie. 1951:2013.

I need to compare Quantile df against the Data df and count the sum of data values exceeding the quantiles value for each year. For that, I'm using ddply in this manner :

ddply(data, .(year), function(y) sum(y[which(y[,2] > quantile[,2]),2]) )

However, the code compares only against the first row of quantile and is not iterating over each of the year against the data df. I want to iterate over each year in quantile df and calculate the sum of data exceeding the quantile df in each year.

Any help shall be greatly appreciated.

The example problem - quantile df is here and Data is pasted here

The quantile df is derived from the data , which is the 90th percentile data df exceeding value 1

quantile = quantile(data[-c(which(prcp2[,2] < 1)),x],0.9)})
comquest
  • 61
  • 6
  • Hi, please provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). And is the 'quantiles'-data derived from the original data, or from an external source? – Heroka Sep 21 '15 at 06:41

2 Answers2

2

In addition to the Heroka answer above, If you have 10,000 columns and need to iterate over each of the column, you can use matrix notation in this form -

lapply(x, function(y) {ddply(data,.(year), function(x){  return(sum(x[x[,y] > quantile(x[x[,y]>1,y],0.9),y]))})})

where x is the size of columns, ie, 1:1000 and data is the df which contains the data. The quantile(x[x[,y]>1,y],0.9),y]) will give the 90th percentile for data values exceeding 1 . x[x[,y] > quantile(x[x[,y]>1,y],0.9),y] returns the rows which satisfies the condition for the yth column and sum function is used to calculate the sum.

comquest
  • 61
  • 6
1

Why not do this in one go? Creating the quantiles-dataframe first and then referring back to it makes things more complicated than they need to be. You can do this with ddply too.

set.seed(1)
data <- data.frame(
  year=sample(1951:2013,23011,replace=T),
  data=rnorm(23011)
)


res <- ddply(data,.(year), function(x){
  return(sum(x$data[x$data>quantile(x$data,.9)]))
})

And -as plyr seems to be replaced with dplyr - :

library(dplyr)


res2 <- mydf %>% group_by(year) %>% summarise(
  test=sum(value[value>quantile(value,.9)])
)
Heroka
  • 12,889
  • 1
  • 28
  • 38
  • Thanks Heroka for the answer. It worked . However you missed one condition where `quantile` df should be greater than 1. Though, I put the condition and it gave the expected answer. The detailed answer will be posted to help other guys. Thanks again. – comquest Sep 22 '15 at 08:48