I'm struggling with some R code that I'm sure must be able to written using one of the apply series of functions, but I can't work out how.
I have a dataframe listing many sites at which I have measurements. This data frame has various pieces of metadata (including the site name), as well as aggregate statistics for that site. I need to select many different groups of sites using the values in the metadata, then get hold of the original raw data (that is, every single observation from each of the sites) and calculate statistics on that.
The selection criteria for these groups are quite complex, and I'm basically doing every possible combination of various different subsets, so I thought it was best to do this via the intersection of indices. So, my code looks like:
# Calculate indices for each of the selection criteria
sets = list(All=1:nrow(df), UK=which(ClassifiedValAERONET$UK == 1))
cat_excluded = list(None=1:nrow(df), Separated=which(ClassifiedValAERONET$Category1_SmallIsland == 0 & ClassifiedValAERONET$Category2_SeparatedLandMass == 0))
# Loop over all combinations of the categories above,
# intersect and then calculate the statistics
for (i in 1:length(sets))
{
for (j in 1:length(cat_excluded))
{
ind = intersect(unlist(sets[i]), unlist(cat_excluded[j]))
ind <- unlist(ind)
print(get_stats(ind))
print("-------------------------------------")
}
}
In this code, I put the indices that match various conditions into lists, then have a nested for loop over these two lists (to get all combinations), intersect the indices to get the rows that match both conditions and then pass these to a function which extracts all of the original data for the stations with those indices, and then calculates the statistics. The results from this function are a list with various statistics in it:
List of 8
$ rmse : num 1.5
$ err_mean : num 0.631
$ err_sd : num 1.37
$ perc_err_mean: num 3.79
$ perc_err_sd : num 10.1
$ m : num 0.949
$ R2 : num 0.993
$ n : int 9163
I then want to put all of the results from each iteration of the nested loop into one data frame, so I have the statistics (rmse
, err_mean
etc) as the columns and each different combination of conditions (sets$All
, cats_excluded$Any
etc) as the rows. Of course, somehow I need to add extra columns to this data frame to say exactly which conditions were used for that row.
I'm pretty sure the way I'm doing this isn't the best way, but I'm not sure how to go about doing this in a 'proper R way'. I deliberately put my statistical calculations in a function so that I could use with apply (or similar), but I can't see what I can apply this over. If I could apply over a data frame with all of the combinations of the categories already in it (see sketch below) then that would be a good start, but I've no idea how to create one of those.
+-----+-----------+
| Set | Excluded |
+-----+-----------+
| All | None |
| All | Separated |
| UK | None |
| UK | Separated |
+-----+-----------+
The ideal final result would be something like:
+-----+-----------+------+--------+-------+
| Set | Excluded | RMSE | Perc_E | Max_E |
+-----+-----------+------+--------+-------+
| All | None | 2.53 | 0.65 | 34.5 |
| All | Separated | 1.87 | 0.54 | 9.87 |
| UK | None | 4.53 | 0.1 | 3.62 |
| UK | Separated | 1.23 | 0.87 | 6.78 |
+-----+-----------+------+--------+-------+
(Although in real life there would be five columns for the various criteria, and about ten columns for statistics)
I hope this has made some sort of sense - any advice would be greatly appreciated.