1

I have a data frame with 3 years worth of sales data that I'm trying to convert to a time series. Manually creating subsets for each of the 36 months:

mydfJan2011 <- subset(myDataFrame, 
                      as.Date("2011-01-01") <= myDataFrame$Dates & 
                      myDataFrame$Dates <= as.Date("2011-01-31"))

...

mydfDec2013 <- subset(myDataFrame, 
                      as.Date("2013-12-01") <= myDataFrame$Dates & 
                      myDataFrame$Dates <= as.Date("2013-12-31"))

and then summing them up and putting them into a vector

counts[1] <- sum(mydfJan2011$itemsSold)

...

counts[36] <- sum(mydfDec2013$itemsSold)) 

to get the values for the time series works fine, but I'd like to make it a little more automatic as I have to create more than one time series, so I'm trying to turn it into a loop.

In order to do that, I need to create a string with a subset command like this:

"subset(myDataFrame, 
        as.Date("2011-01-01") <= myDataFrame$Dates & 
        myDataFrame$Dates <= as.Date("2011-01-31"))"

But when I use paste, the result is this:

myString
>"subset(myDataFrame, as.Date(\"2011-02-01\") <= myDataFrame$Dates & myDataFrame$Dates <= as.Date(\"2011-02-28\"))"

and

 eval(parse(text = myString))

results in the following error message:

Error in charToDate(x) : 
  character string is not in a standard unambiguous format

whereas just typing in the command (without escapes) results in the subset I'm trying to create.

I've tried playing around with single and double quotes, substitute and deparse, but none of it results in any kind of subset of my data frame.

Any suggestions?

Even another way of splitting up the data by month and summing it up would be welcome.

Thanks, Signe

SiKiHe
  • 439
  • 6
  • 16
  • 3
    There is absolutely no reason to be using `eval(parse())` here. You'd be better off using `split()` and `cut.Date()`. I strongly suggested you edit your question to include a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and describe what you really want to do (split and summarize a data.frame) rather than how you think you should do it. This is an example of the classic [XY problem](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) – MrFlick Jan 23 '15 at 16:03
  • 1
    Additionally, `as.Date("2011-02-29")` will give the error you describe (with or without the eval/parse) because there was no 29th of February in 2011. – MrFlick Jan 23 '15 at 16:07
  • The 29th of February was in 2012 - I know :) – SiKiHe Jan 23 '15 at 16:12
  • Well, with that date problem "fixed", you shouldn't be getting the same error message. – MrFlick Jan 23 '15 at 16:14
  • I wrote how I thought I should do it because it worked the donkey way and then I was trying to use what worked in a function with parametres so that I wouldn't have to keep doing it the donkey way every time I need to generate a time series. Maybe that's bad programming practise, but it's the practise I've learned. I know there are people more familiar with R than me, which is why I also asked if there was another way to do it. – SiKiHe Jan 23 '15 at 16:30

2 Answers2

2

Here is a solution using tapply:

with(sales, tapply(itemsSold, substr(Dates, 1, 7), sum))

Produces monthly sums (I limited my data to 9 months for illustrative purposes, but this extends to longer periods):

 2011-01  2011-02  2011-03  2011-04  2011-05  2011-06  2011-07  2011-08  2011-09 
1592.097 1468.427 1594.386 1563.014 1595.489 1560.361 1553.128 1663.705 1325.519      

tapply computes the sum of values in a vector (sales$sales) grouped by the values of another vector (substr(sales$date, 1, 7), which is basically "yyyy-mm"). with allows me to avoid me typing sales$ repeatedly. You should almost never have to use eval(parse(...)). There is almost always a better, faster way to do it without resorting to that.

And here is the data I used:

set.seed(1)                    
sales <- data.frame(Dates=seq(as.Date("2011-01-01"), as.Date("2011-09-30"), by="+1 day"))
sales$itemsSold <- runif(nrow(sales), 1, 100)

For reference, there are also several 3rd party packages that simplify this type of computation (see data.table, dplyr).

BrodieG
  • 51,669
  • 9
  • 93
  • 146
  • Thanks, I think this is faster, too. I wanted to avoid all the looping because I know R doesn't handle loops as well as vectors etc., but I didn't know how to do it without the loops :) – SiKiHe Jan 23 '15 at 16:46
  • 2
    @user3060988, loops are not particularly slow if you pre-allocate the vector you are populating (e.g. `res <- numeric(1000L); for(i in 1:1000)...`). – BrodieG Jan 23 '15 at 16:57
  • 1
    @user3060988 You would be doing the R world a favor if you could track down the person(s) who gave you that misinformation and set them straight. All the `*apply` functions are just loops under the hood. It's forgetting pre-allocation and inefficient code inside the loops that are the source of inefficiencies. If there is not a vectorized solution the provides the proper answer then loops (of one sort or another) are needed. – IRTFM Jan 23 '15 at 18:51
  • In my own experience, I brought the processing time of a loop handling very large amounts of data (with a pre-allocated vector) down from 5 mins + to a matter of seconds by switching from a double or triple loop to the built-in aggregate function - I know this was down to use of a very inefficient algorithm, but I have encountered the same ideas (that R doesn't handle loop very well) in comments from other users in various forums like this. – SiKiHe Jan 26 '15 at 09:09
1

Here's a data.table approach that aggregates by year and month, using the first of the month as the respective group label:

library(data.table)
##
mDt <- Dt[
  ,list(monthSold=sum(itemsSold)),
  keyby=list(mDay=as.Date(paste0(
    year(Dates),"-",month(Dates),"-01")))]
##
R> head(mDt)
         mDay monthSold
1: 2012-01-01       179
2: 2012-02-01       128
3: 2012-03-01       152
4: 2012-04-01       160
5: 2012-05-01       152
6: 2012-06-01       141

Data:

set.seed(123)
Dt <- data.table(
  Dates=seq.Date(
    from=as.Date("2012-01-01"),
    to=as.Date("2014-12-31"),
    by="day"),
  itemsSold=rpois(1096,5))
nrussell
  • 18,382
  • 4
  • 47
  • 60