1

Please let me know if this is a duplicate. I've looked over Stack and have found questions that are similar, but don't quite answer my question. I'm a beginner, so I appreciate anyone's help.

I am trying to add yearly summary variables to a data frame of monthly data (i.e, summing across rows). Here is an example data frame with limited monthly data.

df <- data.frame("Jan.2012" = c(1, 4, 5, 6), "Feb.2012" = c(3, 5, 7, 9),
 "Jan.2013" = c(6, 8, 9, 10), "Feb.2013" = c(7, 5, 11, 13), "Jan.2014" = c(6, 8, 9, 11), 
 "Feb.2014" = c(7, 3, 5, 9))

The new variables would be named TotalYr2012, TotalYr2013, etc. For example, TotalYr2012 = c(4, 9, 12, 15), etc.

I am trying to iterate over a for loop (not best practice I know) to generate these variables. I know I am doing some things wrong with the assign statement as well as I get an error.

for (i in 2012:2014) {
  varname <- paste("TotalYr", i, sep = "")
    assign(df$varname, df %>% select(contains("i")) %>% 
     mutate(varname = sum()))
}

Thanks for your help!

MKR
  • 19,739
  • 4
  • 23
  • 33
LauraDR
  • 86
  • 9

2 Answers2

3

You can avoid for-loop by using tidyr::gather and zoo::yearmon functions to calculate summary for each year.

library(tidyverse)
library(zoo)

df %>% gather(Date, value) %>%
  mutate(Date = as.yearmon(Date,"%b.%Y")) %>%
  group_by(Year = year(Date)) %>%
  summarise(Total = sum(value)) %>%
  spread(Year, Total)

# # A tibble: 1 x 3
#     `2012` `2013` `2014`
# *   <dbl>  <dbl>  <dbl>
# 1   40.0   69.0   58.0  
MKR
  • 19,739
  • 4
  • 23
  • 33
  • I appreciate this. I must not have explained myself clearly. I want to sum across rows of df as each record represents something such Product X sales in Jan. 2012, Product Y sales in Feb. 2013, etc. So I would want to see all of Product X sales for the months of 2012, etc. However, seeing this code did help me understand a dplyr approach to get to the answer i wanted, so thanks! – LauraDR May 14 '18 at 20:24
3

You've got a lot of problems.

  1. $ doesn't work with variables. Use [ or [[ instead. See this R-FAQ for additional info. Also your indentation is messy, let's make it consistent:

    for (i in 2012:2014) {
      varname <- paste("TotalYr", i, sep = "")
      assign(df[[varname]], df %>% select(contains("i")) %>% 
        mutate(varname = sum()))
    }
    
  2. assign isn't needed, just use <- (or =). You should almost never use assign().

    for (i in 2012:2014) {
      varname <- paste("TotalYr", i, sep = "")
      df[[varname]] <- df %>% select(contains("i")) %>% 
        mutate(varname = sum()))
    }
    
  3. "i" is a string, it's value is always the letter "i", just like the value of 2 is always 2. Even in contains() you want to use the value you have assigned to the object i (though that value does need to be a string, that is, it needs to be of class character):

    for (i in 2012:2014) {
      varname <- paste("TotalYr", i, sep = "")
      df[[varname]] <- df %>% select(contains(as.character(i))) %>% 
        mutate(varname = sum()))
    }
    
  4. mutate and select return data frames, which means your code is trying to assign a one-column data frame do df[[varname]]. We want to assign just a column vector, not a whole data frame. So we pull out the column vector, with dplyr::pull

    for (i in 2012:2014) {
      varname <- paste("TotalYr", i, sep = "")
      df[[varname]] <- df %>% select(contains(as.character(i))) %>% 
        mutate(varname = sum())) %>%
        pull()
    }
    
  5. Type sum() in your console - you get 0. You need give something to sum(). Let's get rid of mutate entirely and just sum the pulled vector, this way we don't have to worry about its name:

    for (i in 2012:2014) {
      varname <- paste("TotalYr", i, sep = "")
      df[[varname]] <- df %>% select(contains(as.character(i))) %>% 
        pull %>% sum
    }
    
  6. Okay, it sort of works now. But you're adding these new values to the old data frame, which has a bunch of rows. The new values are just single values, so they're "recycled", repeated on every row of the data frame. Let's create a new result data frame instead which will have just one row for our result:

    result = list()
    for (i in 2012:2014) {
      varname <- paste("TotalYr", i, sep = "")
      result[[varname]] <- df %>% select(contains(as.character(i))) %>% 
        pull %>% sum
    }
    result = as.data.frame(result)
    result
    #   TotalYr2012 TotalYr2013 TotalYr2014
    # 1          24          36          24
    
  7. Now it works, and gives a working solution. However, it is still messy. for loops usually aren't necessary. We have much better tools for tidying data.

    # See MKR's answer. It's the way you should actually do this.
    
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • 1
    Fantastic explanation, which covers almost everything. Probably your answer is much helpful to beginners than mine. – MKR May 12 '18 at 04:45
  • My answer covers what is wrong with OP's approach, your answer covers the right approach. I think they complement each other well. – Gregor Thomas May 12 '18 at 15:52
  • Thanks to you both. I learned a lot of general principles for future projects from just reading the principles you shared. – LauraDR May 14 '18 at 19:17
  • However, when I run this code in RStudio, result returns a data frame with 0 rows and 0 columns. Please let me know if I'm missing something. – LauraDR May 14 '18 at 19:19
  • I forgot to change one last `df` to `result` in #6. Fixed now. – Gregor Thomas May 14 '18 at 19:36