0

This question is more in regards to code clarity, easier maintance or even good practices. I am kind of new to using tidyr, dplyr and this sort of packages but I want to get better at it since they seem to be keystones for data analysis. A few weeks ago I came across this data and, eventhough I tried analysing it via the packages mentioned before, had to resort to a nested for loop. I have always read that if I had to resort to a loop in r, then I am doing something wrong so nothing better than to learn via an example. Can you help me out with this? I have fishing data from some years, some species, some “types of catch” (gear), two different categories (obs), the amount of fishing fished (value) and the corresponding semester. There are some repeated occurrences, by that I mean: for the same year, specie, gear, obs and semester, sometimes there are 2 or more values appearing. Basically, I want to transform this data summing the amount of fishing for each year, each gear, each specie and each obs. This is an example of my data from dput:

data <- structure(list(Year = c(2000, 2000, 2000, 2000, 2000, 2000, 2000, 
2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2005, 2005, 2005, 
2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 
2005), Specie = c("Bagre", "Bagre", "Cabrinha", "Bagre", "Cabrinha", 
"Cabrinha", "Cabrinha", "Cabrinha", "Cabrinha", "Cabrinha", "Cabrinha", 
"Bagre", "Bagre", "Bagre", "Bagre", "Bagre", "Bagre", "Cabrinha", 
"Cabrinha", "Cabrinha", "Cabrinha", "Bagre", "Bagre", "Bagre", 
"Cabrinha", "Cabrinha", "Cabrinha", "Cabrinha", "Bagre", "Bagre"
), Gear = c("Net", "Net", "Boat", "Net", "Net", "Boat", "Boat", 
"Boat", "Net", "Boat", "Boat", "Boat", "Net", "Boat", "Boat", 
"Net", "Net", "Boat", "Net", "Boat", "Boat", "Boat", "Boat", 
"Net", "Boat", "Net", "Boat", "Boat", "Boat", "Boat"), Value = c(43.552, 
1.469, 32.952, 19.35, 0.18, 0.14, 0.1, 150.204, 147.439, 31.28, 
8.86, 7.92, 2.26, 0.48, 0.18, 13.079, 2.529, 201.054, 74.563, 
47.8, 5.04, 1.84, 0.2, 0.14, 322.034, 117.35, 19.74, 6.72, 4.46, 
0.28), Obs = c("Art", "Art", "Ind", "Ind", "Ind", "Ind", "Ind", 
"Ind", "Ind", "Ind", "Ind", "Ind", "Ind", "Ind", "Ind", "Art", 
"Art", "Ind", "Ind", "Ind", "Ind", "Ind", "Ind", "Ind", "Ind", 
"Ind", "Ind", "Ind", "Ind", "Ind"), Semester = c(1, 2, 1, 1, 
1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 
2, 2, 2, 2, 2)), row.names = c(NA, 30L), class = "data.frame")

And this is the code I came up with in order to solve my issue:

df <- data.frame(matrix(ncol = 5, nrow = 0))
names(df) <- c("Year", "Specie", "Gear", "Obs", "Value")

#very inefficient way but works
for (year in unique(data$Year)) {
  tmpyear <- data[data$Year == year,]
  for (sp in unique(tmpyear$Specie)) {
    tmpsp <- tmpyear[tmpyear$Specie == sp,]
    for (gear in unique(tmpsp$Gear)) {
      tmpgr <- tmpsp[tmpsp$Gear == gear,]
      for (obs in unique(tmpgr$Obs)) {
        tmpobs <- tmpgr[tmpgr$Obs == obs,]
        total <- sum(tmpobs$Value)
        tmp_df <- cbind(tmpobs[1,c(1,2,3,5)], total)
        names(tmp_df) <- names(df)
        df <- rbind(df, tmp_df)
      }
    }
  }
}

I am basically looping through each variable of interest and then summing up everything there. This is the output from my code (which works like I intend and produces the output that I want):

structure(list(Year = c(2000, 2000, 2000, 2000, 2000, 2005, 2005, 
2005, 2005, 2005), Specie = c("Bagre", "Bagre", "Bagre", "Cabrinha", 
"Cabrinha", "Bagre", "Bagre", "Bagre", "Cabrinha", "Cabrinha"
), Gear = c("Net", "Net", "Boat", "Boat", "Net", "Net", "Net", 
"Boat", "Boat", "Net"), Obs = c("Art", "Ind", "Ind", "Ind", "Ind", 
"Art", "Ind", "Ind", "Ind", "Ind"), Value = c(45.021, 21.61, 
8.58, 223.536, 147.619, 15.608, 0.14, 6.78, 602.388, 191.913)), row.names = c(1L, 
4L, 12L, 3L, 5L, 16L, 24L, 22L, 18L, 19L), class = "data.frame")

I understand this may seem not worth to investigate since my code already works, but I have some free time right now and as I want to pursue a career in data science, I would like to learn how to avoid all these nested loops (since it seems this is not the correct way to go).

Thank you for your time

  • 4
    Using `dplyr`, `data %>% group_by(Year, Specie, Gear, Obs) %>%summarise(Total = sum(Value), .groups = "drop")` and in base R you may use `aggregate` `aggregate(Value~Year + Specie + Gear + Obs, data, sum)` – Ronak Shah Nov 06 '22 at 01:00

0 Answers0