2

I have two data frames: rainfall data collected daily and nitrate concentrations of water samples collected irregularly, approximately once a month. I would like to create a vector of values for each nitrate concentration that is the sum of the previous 5 days' rainfall. Basically, I need to match the nitrate date with the rain date, sum the previous 5 days' rainfall, then print the sum with the nitrate data.

I think I need to either make a function, a for loop, or use tapply to do this, but I don't know how. I'm not an expert at any of those, though I've used them in simple cases. I've searched for similar posts, but none get at this exactly. This one deals with summing by factor groups. This one deals with summing each possible pair of rows. This one deals with summing by aggregate.

Here are 2 example data frames:

# rainfall df
mm<- c(0,0,0,0,5, 0,0,2,0,0, 10,0,0,0,0)
date<- c(1:15)
rain <- data.frame(cbind(mm, date))
# b/c sums of rainfall depend on correct chronological order, make sure the data are in order by date.
rain[ do.call(order, list(rain$date)),] 

# nitrate df 
nconc <- c(15, 12, 14, 20, 8.5) # nitrate concentration
ndate<- c(6,8,11,13,14)
nitrate <- data.frame(cbind(nconc, ndate))

I would like to have a way of finding the matching rainfall date for each nitrate measurement, such as:

match(nitrate$date[i] %in% rain$date)

(Note: Will match work with as.Date dates?) And then sum the preceding 5 days' rainfall (not including the measurement date), such as:

sum(rain$mm[j-6:j-1]

And prints the sum in a new column in nitrate

print(nitrate$mm_sum[i])

To make sure it's clear what result I'm looking for, here's how to do the calculation 'by hand'. The first nitrate concentration was collected on day 6, so the sum of rainfall on days 1-5 is 5mm.

Many thanks in advance.

Community
  • 1
  • 1
BonnieM
  • 191
  • 1
  • 13
  • Do the dates when the rain and nitrates are measured match each other in value? – ArunK Nov 07 '16 at 15:04
  • What is the expected output for this example? `match(nitrate$ndate, rain$date) - 1` gives the indices for the last day of measurement to account for. Taking the subset of `cumsum(rain$mm)` for those indices and those indices minus 5 days and, then, subtracting the two vectors should give the sum of the last five days for each case. – alexis_laz Nov 07 '16 at 15:06
  • `@theArun`, the dates match each other, but for rainfall there is a measurement everyday and for nitrate it is about once a month, so they differ in length. – BonnieM Nov 08 '16 at 07:59
  • `@alexis_laz` your suggestion might also work, but see the answer below. – BonnieM Nov 08 '16 at 08:00

2 Answers2

1

You were more or less there!

nitrate$prev_five_rainfall = NA 
for (i in 1:length(nitrate$ndate)) { 
    day = nitrate$ndate[i] 
    nitrate$prev_five_rainfall[i] = sum(rain$mm[(day-6):(day-1)])
}

Step by step explanation:

Initialize empty result column:

nitrate$prev_five_rainfall = NA 

For each line in the nitrate df: (i = 1,2,3,4,5)

for (i in 1:length(nitrate$ndate)) { 

Grab the day we want final result for:

    day = nitrate$ndate[i] 

Take the rainfull sum and it put in in the results column

    nitrate$prev_five_rainfall[i] = sum(rain$mm[(day-6):(day-1)])

Close the for loop :)

}

Disclaimer: This answer is basic in that:

  • It will break if nitrate's ndate < 6
  • It will be incorrect if some dates are missing in the rain dataframe
  • It will be slow on larger data

As you get more experience with R, you might use data manipulation packages like dplyr or data.table for these types of manipulations.

Nelson Auner
  • 1,421
  • 1
  • 15
  • 21
  • thank you so much for the answer and the clear explanation! Much appreciated! I think I will actually need to do: `sum(rain$mm[(day-5):(day-1)])` instead of using `day-6` so that I get 5 days, not 6. My mistake! – BonnieM Nov 08 '16 at 07:55
1

@nelsonauner's answer does all the heavy lifting. But one thing to note, in my actual data my dates are not numerical like they are in the example above, they are dates listed as MM/DD/YYYY with the appropriate as.Date(nitrate$date, "%m/%d/%Y").

I found that the for loop above gave me all zeros for nitrate$prev_five_rainfall and I suspected it was a problem with the dates.

So I changed my dates in both data sets to numerical using the difference in number of days between a common start date and the recorded date, so that the for loop would look for a matching number of days in each data frame rather than a date. First, make a column of the start date using rep_len() and format it:

nitrate$startdate <- rep_len("01/01/1980", nrow(nitrate)) 
nitrate$startdate <- as.Date(all$startdate, "%m/%d/%Y")

Then, calculate the difference using difftime():

nitrate$diffdays <- as.numeric(difftime(nitrate$date, nitrate$startdate, units="days"))

Do the same for the rain data frame. Finally, the for loop looks like this:

nitrate$prev_five_rainfall = NA 
for (i in 1:length(nitrate$diffdays)) { 
    day = nitrate$diffdays[i] 
    nitrate$prev_five_rainfall[i] = sum(rain$mm[(day-5):(day-1)])  # 5 days
}
BonnieM
  • 191
  • 1
  • 13
  • I've found that the function only works if `nitrate$diffdays` starts at 1. Otherwise the function is unable to figure out what row `day` refers to in the rain data frame. So you need to add a line using `pmatch()`. The full function should read `nitrate$prev_five_rainfall = NA for (i in 1:length(nitrate$diffdays)) { day = nitrate$diffdays[i] rainday = pmatch(day, rain$diffdays, dup=FALSE) nitrate$prev_five_rainfall[i] = sum(rain$mm[(rainday-5):(rainday-1)]) # 5 days }` – BonnieM Nov 18 '16 at 11:34