Extract monthly data from a table with mean greater than a threshold using R

Question

I have 2 tables (a and b) with 365 records each (1 year data). I want to get mean of each month in table a and if it falls below 0.01 then remove all daily values belonging to that month and output a new table. Also, I want corresponding daily values to be removed from table b as well to produce a new table for it.

For example: If January and April have monthly mean less than 0.01 then output table a and b with 304 values each. Outputs from dput(head(a)) and dput(head(b)) are respectively:

structure(list(V1 = c(0, 0, 0, 0.43, 0.24, 0)), .Names = "V1", row.names = c(NA, 6L), class = "data.frame")

structure(list(V1 = c(0.042022234, 0.014848409, 0.275174289, 0.485364883, 0.177960815, 0.006799459)), .Names = "V1", row.names = c(NA, 6L), class = "data.frame")

I don't know how to use list comprehension in R. Any suggestions would be appreciative.

You really need to try something and then show us what you've tried. 2 hints: `?"["` & `?as.POSIXlt` — Jesse Anderson, Nov 26 '14 at 01:32
Is a table a `data.frame`, or a `table`? Or is it a `list` like the tag suggests? Please make a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) — GSee, Nov 26 '14 at 01:40
edit your question to include the output of `dput(head(a))` and `dput(head(b))` — GSee, Nov 26 '14 at 01:42

jazzurro · Answer 1 · 2014-11-26T04:29:03.740

1

If table means data.frame and data structure is similar to @eclark's sample data, you could try something like this using dplyr.

DATA

set.seed(123)
a <- data.frame(Date=seq.Date(from = as.Date("2013-01-01"),to = as.Date("2013-12-31"),
                length.out = 365), value=rnorm(n = 365,mean = .01,sd = .1))

b <- data.frame(Date=seq.Date(from = as.Date("2013-01-01"),to = as.Date("2013-12-31"),
                length.out = 365), value=rnorm(n = 365,mean = .01,sd = .15))

CODE

library(dplyr)

# Create a column with month
mutate(a, month = as.character(format(Date, "%m"))) -> a
mutate(b, month = as.character(format(Date, "%m"))) -> b

# Get mean for each month and get months with average lower than 0.01 in the data frame, a
summarise(group_by(a, month), average = mean(value)) %>%
filter(average < 0.01) -> wutever

#wutever
#Source: local data frame [5 x 2]
#
#  month       average
#1    01  0.0068172630
#2    04  0.0006111069
#3    05 -0.0052247522
#4    08  0.0008155293
#5    12  0.0054872409

# Remove data points including months in wutever from a and b
filter(a, !month %in% wutever$month) -> newA
filter(b, !month %in% wutever$month) -> newB

edited Nov 26 '14 at 04:29

answered Nov 26 '14 at 02:46

jazzurro

23,179
35
66
76

I have single column with values and don't have date column as part of data frame – Ibe Nov 26 '14 at 02:50
@Ibe Where do you have date then? – jazzurro Nov 26 '14 at 02:50
@jazzuro: Data was downloaded by specifying start and end date from weather service. But table only contain values for that period – Ibe Nov 26 '14 at 02:52
@Ibe In that case, you need to create a sequence of date and create a data frame / data table including the dates and values you have. That will be presumably similar to what eclark created in his/her answer. – jazzurro Nov 26 '14 at 02:56

score 0 · Answer 2 · answered Nov 26 '14 at 01:36

Not really the most elegant or fast way but here is an idea:

a <- data.frame(Date=seq.Date(from = as.Date("2013-01-01"),to = as.Date("2013-12-31"),length.out = 365),a=rnorm(n = 365,mean = .01,sd = .1)) 
b <- data.frame(Date=seq.Date(from = as.Date("2013-01-01"),to = as.Date("2013-12-31"),length.out = 365),b=rnorm(n = 365,mean = .01,sd = .15)) 
require(dplyr)
c <- merge(a,b,by=1)
c <- tbl_df(data = c)
c <- mutate(c, month=substr(c$Date,6,7))
d <- summarise(group_by(c, month),am = mean(a),bm=mean(b))
c <- left_join(c,d)
c <- filter(c, c$am>=.01 & c$bm>= .01)
a <- c[,c(2,3)]
b <- c[,c(2,4)]
remove(c,d)

score 0 · Answer 3 · answered Nov 26 '14 at 01:38

0

Using base function alone, and assuming your two data frames have variables day, month, and value:

> new_a <- do.call(rbind, by(a, a$month, function(df) {
      ifelse(mean(df$value) < 0.01, NULL, df)
  }))
> new_b <- subset(b, day %in% new_a$day)

Or, you can use the plyr package and try:

> new_a <- ddply(a, .(month), function(df) ifelse(mean(df$value) < 0.01, NULL, df))
> new_b <- subset(b, day %in% new_a$day)

answered Nov 26 '14 at 01:38

mmuurr

1,310
1
11
21

I only have one variable in data frame that holds values. Is it possible to use only values instead of having `day` and `month` variables? – Ibe Nov 26 '14 at 02:05
I'm confused... how do you know which value corresponds to which day? You might need to clarify your question a bit. – mmuurr Nov 26 '14 at 03:42
I somehow was able to export date information into downloaded file but it is in following `19800101`. I need date format to convert it into only days, right? – Ibe Nov 26 '14 at 04:21

Extract monthly data from a table with mean greater than a threshold using R

3 Answers3