I have a data set with the following columns:
locID = the location of ID of the observer
yr = the year of the observation in categorical format: P_year
maxFlock = a number counted by the observer
lat = latitude of the location
long = longitude of the location
state = US state of the observation
effortDays = categorical, I, II, III, and IV
effortHours = categorical, A, B, C, D
Here is a sample of the data frame:
PData
locID yr maxFlock lat long state effortDays effortHours
L4278 P_2000 3 41.42 -73.67 NY II C
L4278 P_2000 6 41.42 -73.67 NY III C
L4278 P_2000 4 41.42 -73.67 NY III C
L4278 P_2012 2 41.42 -73.67 NY III B
L4278 P_2012 4 41.42 -73.67 NY IV B
L4278 P_2012 8 41.42 -73.67 NY IV B
L10494 P_2003 4 42.01 -77.44 NY IV C
L10494 P_2003 0 42.01 -77.44 NY IV C
L10494 P_2003 8 42.01 -77.44 NY IV D
L10494 P_2005 4 42.01 -77.44 NY IV C
L10494 P_2005 6 42.01 -77.44 NY IV C
L10494 P_2009 8 42.01 -77.44 NY IV C
I want to make a new column (labelled: xmf) that computes the average of maxFlock. The average, though, has to be computed for each unique combination of locID, yr, effortDays, and effortHours . If I were to run the code on the above sample, the final product would look like this.
PData
locID yr maxFlock xmf lat long state effortDays effortHours
L4278 P_2000 3 3 41.42 -73.67 NY II C
L4278 P_2000 6 5 41.42 -73.67 NY III C
L4278 P_2000 4 5 41.42 -73.67 NY III C
L4278 P_2012 2 2 41.42 -73.67 NY III B
L4278 P_2012 4 6 41.42 -73.67 NY IV B
L4278 P_2012 8 6 41.42 -73.67 NY IV B
L10494 P_2003 4 2 42.01 -77.44 NY IV C
L10494 P_2003 0 2 42.01 -77.44 NY IV C
L10494 P_2003 8 8 42.01 -77.44 NY IV D
L10494 P_2005 4 5 42.01 -77.44 NY IV C
L10494 P_2005 6 5 42.01 -77.44 NY IV C
L10494 P_2009 8 8 42.01 -77.44 NY IV C
I originally tried to do this using:
PData$xmf = ave(myData2$maxFlock, myData2$locID, myData2$yr, myData2$effortDays, myData2$effortHours)
But it didn't work (had to kill it after waiting over a half hour), and I'm also not even sure if ave() can do what I want it to do.
I was thinking about trying something with the split-apply-combine methodology, but I don't think that's exactly what I'm looking for, because I would have to subset for locID, then subset for year, then for effortHours OR effortDays, and I don't want to have to make that choice. I want to do it by unique combinations.
It would also be great if there were a fast way to do this. The data I'm working with is about 2.5 million rows, so if-statements inside of for-loops are definitely not ideal.
Thank you!