6

I have multiple sets of time series data and would like help figuring out the best way to get them into R and analyze them with R. I'm pretty familiar with data.table but not so familiar with R's ts class supporting time series analysis.

In particular, I want to known how to use ts in this situation or if there are limitations in ts (such as problems aggregating a set of ts objects) that make it in appropriate to use here.

The Data

There are a large number of stores. For each store, I have multiple data points for each day, such as sales volume in dollars, sales volume in number of transactions, and store traffic (number of people entering the store). (Actually what I have is a table with columns store ID, date, and the data for that store and date.)

What I've been doing is using a data.table with one row per store, aggregating the data by store into months and storing the values for each month in a separate named column (e.g. jan14_dollars, feb14_dollars...) but this is unwieldy for a lot of reasons, in particular when I want to look at weeks or quarters.

I was thinking the right way to handle this was to have columns of type ts so each row would be just be store, dollars_ts, transactions_ts, traffic_ts but (a) how do I get the data into that format and (b) can ts be combined the way integers can to give me the results I want? If you can only answer (a) or (b) but not both, please do answer what you can.

I cannot provide a realistic data set, but you can generate a random one to play with like this:

require("data.table")

storeData <- CJ(store = toupper(letters), date = seq(as.Date('2012-01-01'), as.Date('2014-01-01'), by="day"))
storeData$dollars = sample(100:100000, nrow(storeData), replace = TRUE)/100
storeData$transactions <- sample(0:1000, nrow(storeData), replace = TRUE)
storeData$traffic  <- storeData$transactions + sample(0:1000, nrow(storeData), replace = TRUE)

head(storeData)
   store       date  dollars transactions traffic
1:     A 2012-01-01   48.60          409     990
2:     A 2012-01-02  996.89           36     428
3:     A 2012-01-03   69.35          647    1103
4:     A 2012-01-04  334.56          953     973
5:     A 2012-01-05  692.99          958    1753
6:     A 2012-01-06  973.32          724    1086

The Analysis

I want to answer questions like "how many stores had positive dollar sales growth?" and "is there a relationship between change in dollars/transaction and change in traffic?" and to bin the data into time periods and compare the answers across time periods (e.g. Q1 this year versus Q1 last year).

Can these kinds of questions be answered using ts? If so, how do I get this data into an appropriate set of columns or is there some structure other than data.table I should be using?

Please show both how to organize the data and then how to use the data to answer the example questions "how many stores had positive dollar sales growth in January 2014 compared to January 2013?" and "what is the overall trend in dollars per transaction for the past 3 months?"

Old Pro
  • 24,624
  • 7
  • 58
  • 106
  • 2
    What about some sample data if you want someone to actually answer those example questions? – talat Jul 25 '14 at 14:49
  • @beginneR The actual data is proprietary, of course, and I haven't found a public data set that fits the needs, so I added code to generate a random data set. Of course there will be no trends or correlations in this set, but you can go through the motions. – Old Pro Jul 25 '14 at 20:29
  • @OldPro Rolling joins work well for this type of analysis. I've recently written a blog post you might find helpful http://gormanalysis.com/?p=176 – Ben Jul 27 '14 at 00:19
  • @BenGorman I read your post but don't see how it helps with aggregation or conversion to `ts` class time series or compute monthly year over year growth. – Old Pro Jul 27 '14 at 18:42
  • too broad - break down to small, easily digestible questions imo – eddi Jul 27 '14 at 20:58
  • @eddi The primary question is can I use the `ts` class in this situation and if so, how? – Old Pro Jul 29 '14 at 20:01
  • @OldPro the primary answer is then simply "yes, same way you use other data", but that's not what you're looking for, is it? Presumably you have one or more specific problems you've encountered and I suggest starting with just (ideally simple) one. – eddi Jul 29 '14 at 20:27
  • @eddi you didn't answer "how". The first problem is how do I get the example data.frame converted to 4 columns and 1 row per store as explained in the question. – Old Pro Jul 29 '14 at 22:44
  • I don't see any `ts` objects in your post, no example input/output pairs, no attempts to **do** anything, except ask for some very broad results. – eddi Jul 29 '14 at 22:53
  • @eddi let's move this to chat. http://chat.stackoverflow.com/rooms/info/58299/room-for-old-pro-and-eddi – Old Pro Jul 29 '14 at 23:25

3 Answers3

7

You're asking a lot of questions here. I recommend you spend time reading about all the things data.table can do involving joins and aggregating data. Here is an example of how you would get the year over year growth of each store in the first quarter.

#get the first day of the first month for your binning
minDate<-min(storeData$date); month(minDate)<-1; day(minDate)<-1

#get the first day of the last month for your binning
maxDate<-max(storeData$date); month(maxDate)<-12; day(maxDate)<-1

#Build some bins
yearly<-data.table(leftBound=seq.Date(minDate,maxDate,by="year"))
quarterly<-data.table(leftBound=seq.Date(minDate,maxDate,by="3 months"))
monthly<-data.table(leftBound=seq.Date(minDate,maxDate,by="month"))

#Example for quarterly data
quarterly[, rollDate:=leftBound]
storeData[, rollDate:=date]

setkey(quarterly,"rollDate")
setkey(storeData,"rollDate")

temp<-quarterly[storeData, roll=TRUE] #associate each (store, date) pair with a quarter

#create a "join table" containing each quarter for each store
jt<-CJ(leftBound=quarterly$leftBound, store=unique(storeData$store))
setkey(temp,"leftBound","store")

dt<-temp[jt, allow.cartesian=TRUE]
dt[, `:=`(year=year(leftBound), quarter=quarter(leftBound))]

qSummary<-dt[,list(dollars=sum(dollars, na.rm=TRUE), 
         transactions=sum(transactions, na.rm=TRUE), 
         traffic=sum(traffic, na.rm=TRUE)),
   by=list(year,quarter,store)] #Summarize the data by quarter

#Get year/year growth for Q1
qSummary[,list(dollarGrowth = dollars[which(year==2014 & quarter==1)] / dollars[which(year==2013 & quarter==1)]), by=store]

 #First five rows...
    store dollarGrowth
 1:     A    0.0134860
 2:     B    0.0137215
 3:     C    0.0188249
 4:     D    0.0163887
 5:     E    0.0037576
Ben
  • 20,038
  • 30
  • 112
  • 189
  • This is useful and interesting information, Ben, and along the lines of what I had been doing (though what you've done is more systematic and organized), but what I really want to know is if the `ts` class can be used in my use case to give the kind of results I'm looking for, and if so, how to make that happen. – Old Pro Jul 29 '14 at 20:01
  • @OldPro Fair enough. I'm admittedly not very familiar with `ts` in R. Good luck. – Ben Jul 29 '14 at 20:06
4

You may want to look into the zoo package. This package uses zoo class which is able to handle irregular time series very well. It also has a zooreg class, which is analogous to ts class, but for this answer we'll stick with zoo.

Here is a quick solution:

Step0: Loading the package and data:

# install.packages("zoo")
library(zoo)

storeData <- data.frame()

st.dates <- seq(as.Date('2012-01-01'),as.Date('2014-01-01'),by="day")

n <- length(st.dates)

storeData <- 
  data.frame(
    store = rep(1:26, n),
    dollars = sample(100:100000, n*26, replace = TRUE)/100,
    transactions = sample(0:1000, n*26, replace = TRUE),
    traffic = sample(0:2000, n*26, replace = TRUE)
  )

Note that I am using a data.frame instead of data.table, and the construction of data is slightly different. With your level of expertise, I am sure you can manipulate it easily.

Step1: Converting the data to time series

Now, in principle you can convert this to zoo or zooreg class right now:

zoo(storeData, order.by=rep(st.dates,26))

But note that each day will have multiple entries corresponding to different stores, so this creates problem with multiple entries on the time series.

Since in our case, we won't have more than one entry per day per store, we want to keep the info for each store separate. So we split the data by store:

storeDataList <- split(storeData,factor(storeData$store))

Now we are ready to convert the data into time series:

storeDataZooList <- lapply(storeDataList,
                           function(storeData) zoo(storeData, order.by=st.dates))

This gives a list of time series data for all the stores.

Step3: Analysis

Now that you have time series data for all the stores, aggregating is straight forward:

You can aggregate a specific field for any store by quarter or month:

aggregate(storeDataZooList[[1]]$dollars,as.yearqtr)
aggregate(storeDataZooList[[1]]$dollars,as.yearmon)

You can aggregate all the fields for any store by quarter or month:

aggregate(storeDataZooList[[1]],as.yearqtr)
aggregate(storeDataZooList[[1]],as.yearmon)

You can aggregate all the fields for all the stores by quarter or month:

lapply(storeDataZooList,aggregate, as.yearmon)
lapply(storeDataZooList,aggregate, as.yearqtr)

The documentation for the package is fairly detailed, and I am sure it will help you tremendously.

Hope this helps!!

EDIT: Note that for simplicity I have kept the storeID as numeric. If you have an alpha-numeric storeID, you will have to drop the storeID at the time series conversion step for aggregation to work:

storeDataZooList <- lapply(storeDataList,
                           function(storeData) zoo(storeData[,-1], order.by=st.dates))
Shambho
  • 3,250
  • 1
  • 24
  • 37
  • 1
    @OldPro: I noticed that you have given the bounty credit, but not accepted the answer. Any remaining question that I can help with?? – Shambho Aug 07 '14 at 17:15
1

Try the package TimeProjection to extract useful features from the date and aggregate on these derived features. In other words, stick to data.frame or data.table unless you want to run algorithms that need ts structure.

Old Pro
  • 24,624
  • 7
  • 58
  • 106
Karsten W.
  • 17,826
  • 11
  • 69
  • 103
  • 1
    `TimeProjection` is an interesting package, but it basically just reduces dates to simpler date parts (e.g. day of week). Not really what I'm looking for. – Old Pro Jul 29 '14 at 20:08