Apologies in advance if this has been addressed before, but I've tried looking through all the questions related to ddply, sapply, and apply, and can't for the life of me figure this one out...
I've written a function, countMonths, that takes day, month, and total days in a billing cycle as arguments, and returns the number of calendar months that the billing cycle was a part of:
countMonths <- function(day, month, cycle.days) {
month.days <- c(31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31)
if (month < 1 | month > 12 | floor(month) != month) {
cat("Invalid month value, must be an integer from 1 to 12")
} else if (day < 1 | day > month.days[month]) {
cat("Invalid day value, must be between 1 and month.days[month]")
} else if (cycle.days < 0) {
cat("Invalid cycle.days value, must be >= 0")
} else {
nmonths <- 1
day.ct <- cycle.days - day
while (day.ct > 0) {
nmonths <- nmonths + 1
month <- ifelse(month == 1, 12, month - 1) # sets to previous month
day.ct <- day.ct - month.days[month] # subtracts days of previous month
}
nmonths
}
}
I'd like to apply this function to every row in a data.frame containing billing records by customer, e.g.
> head(cons2[-1],10)
kwh cycle.days read.date row.index year month day kwh.per.day
1 381 29 2010-09-02 1 2010 9 2 13.137931
2 280 32 2010-10-04 2 2010 10 4 8.750000
3 282 29 2010-11-02 3 2010 11 2 9.724138
4 330 34 2010-12-06 4 2010 12 6 9.705882
5 371 30 2011-01-05 5 2011 1 5 12.366667
6 405 30 2011-02-04 6 2011 2 4 13.500000
7 441 32 2011-03-08 7 2011 3 8 13.781250
8 290 29 2011-04-06 8 2011 4 6 10.000000
9 296 29 2011-05-05 9 2011 5 5 10.206897
10 378 32 2011-06-06 10 2011 6 6 11.812500
> dput(head(cons2[-1],10))
structure(list(kwh = c(381L, 280L, 282L, 330L, 371L, 405L, 441L,
290L, 296L, 378L), cycle.days = c(29L, 32L, 29L, 34L, 30L, 30L,
32L, 29L, 29L, 32L), read.date = structure(c(1283385600, 1286150400,
1288656000, 1291593600, 1294185600, 1296777600, 1299542400, 1302048000,
1304553600, 1307318400), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
row.index = 1:10, year = c(2010, 2010, 2010, 2010, 2011,
2011, 2011, 2011, 2011, 2011), month = c(9, 10, 11, 12, 1,
2, 3, 4, 5, 6), day = c(2L, 4L, 2L, 6L, 5L, 4L, 8L, 6L, 5L,
6L), kwh.per.day = c(13.1379310344828, 8.75, 9.72413793103448,
9.70588235294118, 12.3666666666667, 13.5, 13.78125, 10, 10.2068965517241,
11.8125)), .Names = c("kwh", "cycle.days", "read.date", "row.index",
"year", "month", "day", "kwh.per.day"), row.names = c(NA, 10L
), class = "data.frame")
I tried a couple of options, and none work well. Specifically, I need to be able to pass the value of a given var as a scalar (or length-1 vector) for each row in the data frame, but they always get passed as vectors:
> cons2$tot.months <- countMonths(cons2$day, cons2$month, cons2$cycle.days)
Warning messages:
1: In if (month < 1 | month > 12 | floor(month) != month) { :
the condition has length > 1 and only the first element will be used
2: In if (day < 1 | day > month.days[month]) { :
the condition has length > 1 and only the first element will be used
3: In if (cycle.days < 0) { :
the condition has length > 1 and only the first element will be used
4: In while (day.ct > 0) { :
the condition has length > 1 and only the first element will be used
5: In while (day.ct > 0) { :
the condition has length > 1 and only the first element will be used
I finally was able to get the right result using ddply, treating each row as its own group, but it takes a LONG time:
cons2 <- ddply(cons2, .(account, year, month, day), transform,
tot.months = countMonths(day, month, cycle.days)
)
Is there a better way to apply this function to each row of my data frame? Or, as a related question, how can I pass variables from a data frame as scalar arguments (the value from a given row) instead of the vector of all values of that var in the data frame? I'd especially appreciate if someone can point out where I'm going wrong conceptually in my thinking.