Loop over unique values R

Question

I previously posted a loop question and am trying another loop with no success. Help with trying to figure this out would be greatly appreciated. As of now, to get my work done I'm going to subset the data by year and run my original function as is, but one of the datasets I'm working with is a long time series. My original function calculates the number of fish at age for a given year dataset. This function works fine. What I would like to do is add a for loop that will allow the function to loop for all years and give the same information.

Data:

x <- structure(list(Year = c(2007, 2012, 2012, 2007, 2012, 2007, 2012, 
2007, 2012, 2007, 2012, 2007, 2012, 2007, 2012, 2007, 2012, 2007, 
2012, 2007, 2012, 2007, 2012, 2007, 2012, 2007, 2012, 2007, 2012, 
2007, 2012, 2007, 2012, 2007, 2012, 2007, 2012, 2007, 2012, 2007, 
2012, 2007, 2012, 2007, 2012, 2007, 2012, 2007, 2012, 2007, 2012, 
2007, 2012, 2012, 2007, 2012, 2007, 2012, 2012, 2007, 2012, 2012, 
2007, 2012, 2007, 2012, 2007, 2012, 2007, 2012, 2012, 2007, 2012, 
2012, 2012, 2012, 2007, 2012, 2007, 2012, 2007, 2012, 2007, 2012
), Season = c("Fall", "Fall", "Fall", "Fall", "Fall", "Fall", 
"Fall", "Fall", "Fall", "Fall", "Fall", "Fall", "Fall", "Fall", 
"Fall", "Fall", "Fall", "Fall", "Fall", "Fall", "Fall", "Fall", 
"Fall", "Fall", "Fall", "Fall", "Fall", "Fall", "Fall", "Fall", 
"Fall", "Fall", "Fall", "Fall", "Fall", "Fall", "Fall", "Fall", 
"Fall", "Fall", "Fall", "Fall", "Fall", "Fall", "Fall", "Fall", 
"Fall", "Fall", "Fall", "Fall", "Fall", "Fall", "Fall", "Fall", 
"Fall", "Fall", "Fall", "Fall", "Fall", "Fall", "Fall", "Fall", 
"Fall", "Fall", "Fall", "Fall", "Fall", "Fall", "Fall", "Fall", 
"Fall", "Fall", "Fall", "Fall", "Fall", "Fall", "Fall", "Fall", 
"Fall", "Fall", "Fall", "Fall", "Fall", "Fall"), Length = c(6, 
9, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 16, 16, 17, 17, 
18, 18, 19, 19, 20, 20, 21, 21, 22, 22, 23, 23, 24, 24, 25, 25, 
26, 26, 27, 27, 28, 28, 29, 29, 30, 30, 31, 31, 32, 32, 33, 33, 
34, 34, 35, 35, 36, 37, 37, 38, 38, 39, 40, 40, 41, 42, 42, 43, 
43, 44, 44, 45, 45, 46, 47, 47, 48, 49, 50, 51, 51, 52, 52, 53, 
54, 55, 58), Exp_number = c(2, 1, 3, 2, 2, 6, 4, 11, 6, 24, 13, 
38, 41.208, 26, 77.096, 37, 227.704, 41, 276.064, 20, 276.536, 
23, 277.008, 23, 72.832, 11, 66.096, 8, 43.888, 12, 13.472, 14, 
2, 14, 4, 8, 4, 10, 5, 12, 2, 13, 5, 9, 2, 7, 1, 4, 3, 2, 2, 
8, 2, 3, 2, 1, 3, 2, 5, 1, 8, 2, 2, 2, 1, 6, 1, 2, 1, 1, 4, 1, 
3, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1)), .Names = c("Year", "Season", 
"Length", "Exp_number"), row.names = c(8L, 43L, 55L, 64L, 68L, 
75L, 78L, 86L, 91L, 98L, 103L, 110L, 115L, 120L, 125L, 131L, 
136L, 143L, 148L, 157L, 162L, 169L, 174L, 181L, 186L, 193L, 197L, 
206L, 211L, 220L, 225L, 234L, 238L, 247L, 252L, 260L, 265L, 274L, 
279L, 288L, 293L, 302L, 307L, 316L, 320L, 329L, 334L, 343L, 346L, 
355L, 360L, 368L, 371L, 383L, 392L, 395L, 404L, 409L, 422L, 430L, 
435L, 447L, 456L, 461L, 468L, 472L, 480L, 483L, 491L, 495L, 505L, 
512L, 516L, 527L, 537L, 545L, 550L, 553L, 558L, 562L, 565L, 568L, 
571L, 583L), class = "data.frame")


y <- structure(c(4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 
18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 
50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 4, 5, 30, 29, 28, 
17, 8, 8, 6, 16, 26, 59, 46, 77, 89, 78, 64, 51, 34, 31, 27, 
19, 30, 21, 26, 13, 26, 18, 12, 8, 9, 12, 9, 6, 13, 12, 20, 10, 
14, 14, 11, 8, 10, 13, 7, 6, 4, 4, 8, 2, 2, 4, 3, 0, 2, 0, 1, 
1, 1, 1, 1, 1, 0.941176470588235, 0.875, 0.625, 0.666666666666667, 
0.375, 0.423076923076923, 0.423728813559322, 0.391304347826087, 
0.246753246753247, 0.235955056179775, 0.153846153846154, 0.203125, 
0.0980392156862745, 0.0882352941176471, 0, 0.037037037037037, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.0588235294117647, 
0.125, 0.375, 0.333333333333333, 0.5625, 0.5, 0.457627118644068, 
0.478260869565217, 0.545454545454545, 0.561797752808989, 0.564102564102564, 
0.59375, 0.647058823529412, 0.411764705882353, 0.483870967741935, 
0.222222222222222, 0.157894736842105, 0.0666666666666667, 0.0476190476190476, 
0.0384615384615385, 0.0769230769230769, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.0625, 0.0769230769230769, 
0.11864406779661, 0.130434782608696, 0.207792207792208, 0.191011235955056, 
0.269230769230769, 0.171875, 0.176470588235294, 0.5, 0.483870967741935, 
0.481481481481481, 0.736842105263158, 0.8, 0.619047619047619, 
0.576923076923077, 0.615384615384615, 0.423076923076923, 0.277777777777778, 
0.333333333333333, 0.25, 0.111111111111111, 0.166666666666667, 
0.111111111111111, 0, 0, 0, 0, 0, 0.142857142857143, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0.0112359550561798, 0.0128205128205128, 
0.03125, 0.0784313725490196, 0, 0.032258064516129, 0.185185185185185, 
0.105263157894737, 0.1, 0.285714285714286, 0.269230769230769, 
0.307692307692308, 0.5, 0.5, 0.583333333333333, 0.625, 0.555555555555556, 
0.666666666666667, 0.444444444444444, 0.5, 0.538461538461538, 
0.333333333333333, 0.25, 0.1, 0, 0.214285714285714, 0, 0.125, 
0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.037037037037037, 
0, 0.0333333333333333, 0.0476190476190476, 0.115384615384615, 
0, 0.0769230769230769, 0.166666666666667, 0.0833333333333333, 
0.125, 0.222222222222222, 0.166666666666667, 0.444444444444444, 
0.333333333333333, 0.230769230769231, 0.333333333333333, 0.4, 
0.9, 0.571428571428571, 0.357142857142857, 0.545454545454545, 
0.5, 0.4, 0.230769230769231, 0.142857142857143, 0.333333333333333, 
0.25, 0, 0.375, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.037037037037037, 0, 
0, 0, 0, 0, 0, 0.0555555555555556, 0, 0, 0.111111111111111, 0, 
0, 0.166666666666667, 0.230769230769231, 0.25, 0.25, 0, 0.214285714285714, 
0.214285714285714, 0.272727272727273, 0, 0.4, 0.461538461538462, 
0.285714285714286, 0, 0.75, 0.5, 0.25, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.0833333333333333, 
0.1, 0, 0.0714285714285714, 0.142857142857143, 0.0909090909090909, 
0.125, 0.1, 0.307692307692308, 0.285714285714286, 0.333333333333333, 
0, 0.25, 0.25, 0, 1, 0.5, 0, 0, 0.5, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.0714285714285714, 0.0909090909090909, 
0, 0, 0, 0.285714285714286, 0.333333333333333, 0, 0, 0, 0, 0, 
0.25, 0.333333333333333, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.25, 0, 0, 0, 0, 0, 0.25, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.125, 0.5, 0, 0.25, 
0.666666666666667, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.5, 0, 
0, 0, 0, 0.5, 0, 1), .Dim = c(57L, 13L), .Dimnames = list(NULL, 
    c("len", "nl", "A0", "A1", "A2", "A3", "A4", "A5", "A6", 
    "A7", "A8", "A9", "A10")))

z <- structure(list(Length = c(4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 
15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 
31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 
47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60)), .Names = "Length", row.names = c(NA, 
-57L), class = "data.frame")

Original function - subset x for one year to run.

subx <- subset(x,Year==2007)

myfunction <- function (x,y,z) {
    #x - expanded length data by year and season
    #y - alk as a percent by season
    #z - length bins from alk percent data 

    #merge z and y to make sure both data sets have the same number of length bins
    comb<-merge(z,x,by="Length",all=T)

    #replace NA by column
    comb$Year[is.na(comb$Year)]<-unique(x$Year)
    comb$Exp_number[is.na(comb$Exp_number)] <- 0

    #catch at age   
    catchatage<-comb[,4]*y[,3:13]
    columntotal<-colSums(catchatage[1:dim(catchatage)[1],1:dim(catchatage)[2]])

    #assign and combine age, number of fish at age and year 
    Number=as.data.frame(columntotal)
    Age=rownames(Number);Age<-as.numeric(gsub("A","",Age))  #get rid of the letter A in age
    Year=rep(unique(x$Year),length(columntotal))

    age<-as.data.frame(cbind(Year,Age,Number=Number[,1]))

    #reorder age
    age<-age[order(age[,2]),]
    return(age)
}
myfunction(subx,y,z)

My function with a loop for unique year values - use entire x dataset.

myfunction_2 <- function (x,y,z) {
    #x - expanded length data by year and season
    #y - alk as a percent by season
    #z - length bins from alk percent data

    #loop through years in survey
    #get unique year values from combined year season dataset
    y_levels<-unique(x$Year)

    for (i in length(y_levels)){

    #subset the data 
    subset_data<-x$Year==y_levels[i]

    #merge z and y to make sure both data sets have the same number of length bins
    comb<-merge(z,subset_data,by="Length",all=T)

    #replace NA by column
    comb$Year[is.na(comb$Year)]<-unique(x$Year[i])
    comb$Exp_number[is.na(comb$Exp_number)] <- 0

    #catch at age   
    catchatage<-comb[,4]*y[,3:13]
    columntotal<-colSums(catchatage[1:dim(catchatage)[1],1:dim(catchatage)[2]])

    #assign and combine age, number of fish at age and year 
    Number=as.data.frame(columntotal)
    Age=rownames(Number);Age<-as.numeric(gsub("A","",Age))  #get rid of the letter A in age
    Year=rep(unique(subset_data$Year[i]),length(columntotal))

    age<-as.data.frame(cbind(Year,Age,Number=Number[,1]))

    #reorder age
    age<-age[order(age[,2]),]

    return(age)

    }
}

myfunction_2(x,y,z)

The error message I receive is:

Error in fix.by(by.y, y) : 'by' must specify a uniquely valid column

So I think my loop must not be subsetting the data by unique year.

Thank you.

Is it an option for you to use `merge` outside the function and pass the `comb` object instead? You could then subset `comb` inside of the function. — Karsten W., Jul 16 '14 at 15:27
No - I did try that already, but when I do the merge outside of the function not all length bins are added for all years. Since there is one length bin in at least one year, the merge function merges that. When I subset data by year the merge function works because not all length bins are in a given year, using all=T adds in the missing bins. — user41509, Jul 16 '14 at 15:32
`catchatage[1:dim(catchatage)[1],1:dim(catchatage)[2]]` is the same as `catchatage` - why not remove the subscripts and just have `colSums(catchatage)` — konvas, Jul 16 '14 at 15:44
Yes - they are the same thing. I carried over the dim... code from something I had been working on earlier. Thanks — user41509, Jul 16 '14 at 17:21

Vincent Guillemot · Answer 1 · 2014-07-16T16:55:24.190

3

There are several thing that should be changed in order for this code to work:

subset_data<-x$Year==y_levels[i] does not actually define a subset, it should rather be subset_data <- subset(x, Year==y_levels[i])
in the loop, for (i in length(y_levels)) should be for (i in 1:length(y_levels)), otherwise the loop will only apply to year 2012
return(age) is inside the loop when it should be outside
the results of each iteration are not combined

After correcting point 1, it should get much easier to correct the rest.

Another suggestion: your second function should use the first one, the code will be easier to read.

Finally, another way to loop without a "for" statement would be by used an lapply (or rather do.call, see @konvas' comment):

 do.call(rbind, lapply(unique(x$Year), function(yy) myfunction(subset(x,Year==yy),y,z) ))

edited Jul 16 '14 at 16:55

answered Jul 16 '14 at 15:36

Vincent Guillemot

3,394
14
21

2

Using `Reduce` with `rbind` is not a good idea (very slow), but it can be replaced by `do.call(rbind, ...)` for better performance. – konvas Jul 16 '14 at 15:43
That is one very useful tip, any idea why? (I edited my code according to this comment) – Vincent Guillemot Jul 16 '14 at 16:29
4

It's because `Reduce` will apply `rbind` many times but `do.call` will only call it once. For example if you have three data frames in a list `l = list(d1, d2, d3)` then `Reduce` is like calling `rbind(rbind(d1, d2), d3)` but `do.call` is like `rbind(d1, d2, d3)`. – konvas Jul 16 '14 at 16:44

score 1 · Answer 2 · answered Jul 16 '14 at 16:42

Although your code should work (after fixing the errors suggested by @VincentGuillemot), you can make myfunction a bit more readable, if you want. I think it would be worth to modify your starting data frame, right from the beginning, to include every combination of Length and Year (rather than doing it in myfunction, for one year at a time).

So, since what you are after is all length-year combinations, how about something like

# create a data frame consisting of all length-year combinations
data <- expand.grid(Length = z$Length, Year = unique(x$Year)) 
data <- merge(data, x, all = TRUE) # merge with x
data <- merge(data, as.data.frame(y), by.x = "Length", by.y = "len") # merge with y
data$Exp_number[is.na(data$Exp_number)] <- 0 # set missing Exp_number values to 0

At this stage your data is in one data frame (instead of 3) and the missing values have been taken care of (except for the column Season, which you don't seem to care about). I find it easier to perform analysis on d (rather than x, y, z) and you can focus on the actual computations rather than on merging data frames and replacing NAs.

Now, your function will look something like

myfunction <- function(d) {
    # d is a subset of data for a given year
    catchatage <- d$Exp_number * d[grep("^A[0-9]*$", names(d))]
    Number <- colSums(catchatage)
    Age <- as.numeric(gsub("A", "", names(Number)))
    result <- data.frame(Year = d$Year[1],
        Age = Age, Number = Number)
    result[order(result$Age), ]
}

I prefer using column names and grep instead of column indices, because using indices can lead to errors that go undetected, if the order/structure of the data changes (on the other hand, the names of the variables rarely change).

It remains to apply myfunction to subsets of data by Year, and combine the results. This can be done in many ways, using lapply as @VincentGuillemot suggested in his post, or by (or several other non-base methonds like plyr, dplyr, data.table if you are interested to look into them)

do.call(rbind, by(data, list(data$Year), myfunction))

Thanks to Vincent Guillemot and konvas for you suggestions. I tried both of them and am still having trouble with my original function and getting the iterations saved so maybe I'll post that later. Konvas your code worked great and is much simpler than mine. Can you explain what d[grep("^A[0-9]*$", names(d))] does please. I'm not familiar with this. I wondering because my other datasets will have a different number of ages and I think I will need to change the range in between the brackets to handle this. Thanks — user41509, Jul 16 '14 at 17:50
`grep` searches for a pattern in a string (or a vector of strings) and returns `TRUE` if there is a match and `FALSE` if there isn't one. The pattern is called a regular expression, and in this case it is "^A[0-9]*$". The ^ signals the beginning of the string and $ the end. [0-9] stands for numbers between 0 and 9 and * means match the previous element 0 or more times. So this regular expression will match all strings that start with A followed by numbers. — konvas, Jul 16 '14 at 21:54
(continued from above) If it confuses you can just do `d[paste0("A", 1:10)]` in this case, but using `grep` allows for other data frames with more or fewer age columns (provided that the pattern stays the same). You can read `?grep` for more info and also search for "regular expression" on the internet. — konvas, Jul 16 '14 at 21:55

Loop over unique values R

2 Answers2

Linked