How can I filter out data and summarize from multiple columns?

Question

I have a dataset (precipitation) with four columns. I want to summarize (in table format) the amount of rain that occurred on a monthly basis for each month in 2019 and 2020 (sum and difference between two years). I am struggling how to summarize heaps of daily data to give me a monthly summary AND filtering it out for quality that is "Good".

Columns in Dataset:

colnames(rain_file)

"ID" "deviceID" "remarks" "date" "amount_rain" "quality"

Date (The date column is formatted as follows and there are multiple readings for each date)

head(rain_file$date)

2018-01-01 2018-01-01 2018-01-01 2018-01-01 2018-01-01 2018-01-01 1096 Levels: 2018-01-01 2018-01-02 2018-01-03 2018-01-04 2018-01-05 ... 2020-12-31

Quality (5 types of Quality, I only want to filter for "Good")

head(rain_file$quality)

Good Good Good Good Good Good... Levels: Absent Good Lost Poor Snow Trace

I have this so far but it's not correct and I'm not sure what to do next...

data=read.table("rain_file.csv", header=TRUE, sep=",", fill=T, quote="\"")

dates=apply(data,1, function(x) {strsplit(x["date"],"-")})

data=cbind(data, t(as.data.frame(dates, row.names=c("year", "month", "day"))))
m_rain_df=tapply(data$amount_rain, data[,c("year","month")], mean, na.rm=T)
data=data.table(data)           
m_rain_dt=data[, list(month_rain=mean(amount_rain, na.rm=T)), by=list(year, month)]

Please consider providing us with a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) so that we can better help you. — lovalery, Oct 15 '21 at 13:28

score 0 · Answer 1 · answered Oct 15 '21 at 07:20

Here a solution using dplyr:

library(tydiverse)

## create a dummy dataset
dat <-
  tibble(
    date = factor(c('2018-01-01', '2018-01-02', '2018-01-03', '2018-02-01', '2018-02-02', '2018-02-03')),
    quality = factor(c('Absent',    'Good',       'Good',       'Snow',       'Good',       'Good')),
    amount_rain = runif(6)
  )

dat %>%
  ## split date column in year month day
  mutate(date = as.character(date)) %>%
  separate(date, c("year", "month", "day"), sep = '-') %>%
  ## keep only good quality data
  filter(quality == 'Good') %>%
  ## summatize by year and month
  group_by(year, month) %>%
  summarise(
    mean_amount_rain = mean(amount_rain)
  )

Which gives:

# A tibble: 2 × 3
# Groups:   year [1]
  year  month mean_amount_rain
  <chr> <chr>            <dbl>
1 2018  01               0.729
2 2018  02               0.466

How can I filter out data and summarize from multiple columns?

1 Answers1