0

I am trying to find a clean, efficient to create a new variable with complex calculations on 5 existing variables. My problem is, the fact that one variable is a factor, and the other 4 contain NAs.

I have a dataset with several groups of variables with the following structure:

  • expenditure_period - factor for which 1 = daily, 2 = weekly, 3 = monthly, 4 = yearly
  • expenditure1 - integer, amount spent for a daily period
  • expenditure2 - integer, amount spent for a weekly period
  • expenditure3 - integer, amount spent for a monthly period
  • expenditure4 - integer, amount spent for a yearly period

For each row/observation, only one of the 4 integer fields has a numerical value depending on the value of expenditure_period, the rest are NAs.

For example:

   expenditure_period  expenditure1  expenditure2  expenditure3  expenditure4
1             monthly            NA            NA             5            NA
2              weekly            NA             5            NA            NA
3             monthly            NA            NA             2            NA
4             monthly            NA            NA             5            NA
5             monthly            NA            NA            58            NA

I want to create a new variable that contains the standard monthly expenditure. So if the expenditure_period was daily, then expenditure1*30. If weekly, then expenditure2 * 4. If monthly, then expenditure3*1. And if yearly, then expenditure4 / 12.

The best solution I could come up with is the following mess:

data$expenditure_factor[data$expenditure_period=="daily"] <- 30
data$expenditure_factor[data$expenditure_period=="weekly"] <- 4
data$expenditure_factor[data$expenditure_period=="monthly"] <- 1
data$expenditure_factor[data$expenditure_period=="yearly"] <- 1/12
data$expenditure_month <- apply(data[,c("expenditure1", "expenditure2",
 "expenditure3", "expenditure4", "expenditure_factor")], 1, 
function(x) { sum(x[1:4], na.rm=TRUE) * x[5]} )

I tried adding together expenditure1,2,3,4 using the + operator, but this resulted in all NAs due to adding 1 number to 3 NAs. I tried creating an interim variable using the sum function with rm.na, but this resulted in the same sum for every row. I tried using mutate from the dplyr package, to no effect.

Is there a simpler, more elegant way to do this? I have to do the same process on about 12 different expenditure categories. I apologize if this has been asked before, I could not find a similar thread. Please direct me if there is one already.

I'm using RStudio with R 3.2.3 on Windows 7.

Derek P
  • 3
  • 1
  • It would be better if your example were easily reproducible and you showed your desired/expected result as well. Here's some guidance: http://stackoverflow.com/a/28481250/1191259 – Frank Feb 16 '16 at 19:07
  • 1
    use a row-wise `apply` statement with `switch` – alexwhitworth Feb 16 '16 at 19:12

2 Answers2

0

"Clean, efficient" is an opinion, but the following would be pretty easy to maintain and understand if you haven't looked at the code for a while. It keeps data in separate tables, does one thing at a time, and can be inspected between steps.

# conversion table to replace bulk of mess with slightly better mess of code that is easy to inspect
expenditure_factor <- data.frame(expenditure_period = c('daily','weekly','monthly','yearly'),
                                 pfactor = c(30,4,1,1/12),
                                 stringsAsFactors = F)

# sum total expenditure (expenditurex) and remove extra columns
data$sumexpenditure <- apply(data[ ,2:5],1,sum,na.rm = T)
data$expenditure1 <- data$expenditure2 <- data$expenditure3 <- data$expenditure4 <- NULL

# add factor from conversion table
data <- merge(data,expenditure_factor,by = 'expenditure_period',all.x = T)

# calculate final answer
data$expenditure_month <- data$sumexpenditure * data$pfactor

Or this could be shoved into a one-liner.

Assuming expenditure_period is a character variable:

data$expenditure_period <- as.character(data$expenditure_period)

Then:

# sum total expenditure
data$sumexpenditure <- apply(data[ ,2:5],1,sum,na.rm = T)

# use an index
data$expenditure_factor <- c(30,4,1,1/12)[match(data$expenditure_period,c('daily','weekly','monthly','yearly'))]

# calculate final answer
data$expenditure_month <- data$sumexpenditure * data$expenditure_factor
ARobertson
  • 2,857
  • 18
  • 24
  • I like both, but something along the lines of the last example is what I was after. Thank you! That one seems less cumbersome and more readable to repeat several times. I like that the first one has a reference table of the factors, but since I have to add it to the based on the value of each different expenditure's period anyway, it doesn't really make the code shorter per se. – Derek P Feb 16 '16 at 21:36
0

Okay, this might be a somewhat unorthodox approach, but what if you renamed your columns so that they contain the multpilier, reshaped your data and extracted the multipliers to use for computing the new variable:

library(dplyr)
library(tidyr)

# New cols
data<-rename(data, expenditure.30 = expenditure1, 
            expenditure.4 = expenditure2,
            expenditure.1 = expenditure3,
            `expenditure.1/2` = expenditure4)

# Reshape and calculate new col
data %>% gather(exp_new,exp_val,expenditure.30:`expenditure.1/2`) %>% 
        mutate(mont_exp = exp_val * as.numeric(sub('.*\\.', '', exp_new))) %>%
        na.omit()
#   expenditure_period       exp_new exp_val mont_exp
#7              weekly expenditure.4       5       20
#11            monthly expenditure.1       5        5
#13            monthly expenditure.1       2        2
#14            monthly expenditure.1       5        5
#15            monthly expenditure.1      58       58
mtoto
  • 23,919
  • 4
  • 58
  • 71
  • Unorthodox but interesting! I like that it takes advantage of dplyr and tidyr. Many thanks for your help. – Derek P Feb 16 '16 at 21:31