Creating new variables in grouped data

Question

I have my data that looks like below: (basically lets say its sales of different brands by customer trip, blank means the brand isnt purchased in that particular trip for a customer,store refers to store location of purchase)

customerid  date    store   brand1  brand2  brand3  brand4
1   01-03-2012  a    $3.00   $-      $-      $2.00 
1   06-03-2012  a    $2.00   $-      $-      $3.00 
1   11-03-2012  b    $2.00   $1.00   $1.00   $1.00 
1   26-03-2012  a    $2.00   $-      $-      $-   
2   16-03-2012  d    $2.00   $1.00   $1.00   $2.00 
2   21-03-2012  a    $-      $-      $1.00   $2.00 
2   26-03-2012  a    $2.00   $1.00   $3.00   $1.00

i would like to create separate dataframes for each brand containing rows only where that brand has sales >0 , so what i thought.. i could put the brand1-brand4 in a list called colnames_df, like below:

 colnames_df<- colnames(myDf)

Once i do this, i can loop through the contents of this loop to generate the brand level datasets... From data above, i need 4 separate datasets with the relevant brand column and other columns of custID, date alone.. like 4 datasets below is what i want:

Dataset for brand1: (expected output)

customerid  date    store   brand1
1   01-03-2012  a    $3.00 
1   06-03-2012  a    $2.00 
1   11-03-2012  b    $2.00 
1   26-03-2012  a    $2.00 
2   16-03-2012  d    $2.00 
2   26-03-2012  a    $2.00

Dataset for brand2: (expected output)

   customerid   store   date    brand2
1   b   11-03-2012   $1.00 
2   d   16-03-2012   $1.00 
2   a   26-03-2012   $1.00

Likewise there will be dataframes for Brand3 and 4 as well...For this part, should i write something like for( i in length(colnames_df) { paste("Brand",i)<-}... Not sure how to write this.. I need to create brand level data frames from original data above...If i use lapply andsuch functions, i was able to figure out how to obtain list/dataframe with all columns in resulting data.. How do i do what i need to do above...

Apart from above, i have one other requirement:

If the brand level datasets are created, i also need to create lag, counter variables like below on each brand level dataset...

Step1: Create counter variable for each customer trip (after dataset is sorted by custID and Date)...

Expected output for brand1(with counter):

code which i use (im having difficulty putting this code in a loop so that each brand level dataset that is created has the new variable automatically created..instead of brand1 below, it should automatically be brand1,2,3,4 etc)

brand1$counter <- with(brand1, ave(customerID, customerID, FUN = seq_along))

customerid  date    store   brand1  counter_custtrip
1   01-03-2012  a    $3.00  1
1   06-03-2012  a    $2.00  2
1   11-03-2012  b    $2.00  3
1   26-03-2012  a    $2.00  4
2   16-03-2012  d    $2.00  1
2   26-03-2012  a    $2.00  2

2.step2: Create a lag variable....like expected output below..

I can use code like this: (my question is i can do these operations separately per dataset, but how do i do it in such a way that all this happens as each brand level dataset gets created...???)

ddply(.data = df, .variables = .(customerID), mutate,
   lagdate = c(NA, head(date, -1))

Expected output is: (for brand1 dataset)

  customerid    date    store   brand1  counter_custtrip    laggedtripdate
1   01-03-2012  a    $3.00  1   -
1   06-03-2012  a    $2.00  2   01-03-2012
1   11-03-2012  b    $2.00  3   06-03-2012
1   26-03-2012  a    $2.00  4   11-03-2012
2   16-03-2012  d    $2.00  1   -
2   26-03-2012  a    $2.00  2   16-03-2012

step3: Create days between trips variable by store

See expected output for brand1 (and likewise for all brands)

customerid  date    store   brand1  counter_custtrip    laggedtripdate  daysbetweentrips
1   01-03-2012  a    $3.00  1   -   -
1   06-03-2012  a    $2.00  2   01-03-2012  5
1   11-03-2012  b    $2.00  3       -
1   26-03-2012  a    $2.00  4   06-03-2012  20
2   16-03-2012  d    $2.00  1   -   -
2   26-03-2012  a    $2.00  2   16-03-2012  -

As we can see, CustomerID 1 has gone to store a on 3/1, and then 5 days later on 3/6 and then 20 days later on 3/26.. This is the logic.. How do i do this for each customer for each store??...

I know there is a lot, and i am almost there, i just need a couple of lines of advice on how i can put this whole structure together, so that i can put this in a loop with new brand level datasets getting created and each of them having all new variables created during the dataframe creation process....

Let me know what i am missing

dnlbrky · Answer 1 · 2013-09-02T13:34:43.543

Try the following answer, which converts to long format and uses data.table:

library(data.table)

# Your data:
data <- structure(list(customerid = c(1L, 1L, 1L, 1L, 2L, 2L, 2L), date = structure(c(1325566800, 
1338696000, 1351915200, 1332734400, 1331870400, 1332302400, 1332734400
), class = c("POSIXct", "POSIXt"), tzone = ""), store = c("a", 
"a", "b", "a", "d", "a", "a"), brand1 = c(3L, 2L, 2L, 2L, 2L, 
NA, 2L), brand2 = c(NA, NA, 1L, NA, 1L, NA, 1L), brand3 = c(NA, 
NA, 1L, NA, 1L, 1L, 3L), brand4 = c(2L, 3L, 1L, NA, 2L, 2L, 1L
)), .Names = c("customerid", "date", "store", "brand1", "brand2", 
"brand3", "brand4"), row.names = c(NA, -7L), class = c("data.table", 
"data.frame"))

# Convert from wide format to long, and subset to records with sales > 0:
data.long<-data.table(data[,list(customerid,store,date,laggedtripdate=as.POSIXct(NA))], brand=names(data)[4:7], sales=c(t(as.matrix(data[,4:7,with=F]))),key=c("customerid","date"))[sales>0]

# Add the lagged date, by customerid:
data.long[data.long[,.N,by=list(customerid,date)][,laggedtripdate:=c(as.POSIXct(NA),date),by=customerid],laggedtripdate:=i.laggedtripdate]

# Add daysbetweentrips:
data.long[,daysbetweentrips:=date-laggedtripdate]

# Add counter_custtrip:
data.long[,counter_custtrip:=1:.N,by=list(customerid,brand)]

# Subset of results for brand==1:
data.long[brand=="brand1"]
#   customerid store       date laggedtripdate  brand sales daysbetweentrips counter_custtrip
#1:          1     a 2012-01-03           <NA> brand1     3          NA days                1
#2:          1     a 2012-03-26     2012-01-03 brand1     2    82.95833 days                2
#3:          1     a 2012-06-03     2012-03-26 brand1     2    69.00000 days                3
#4:          1     b 2012-11-03     2012-06-03 brand1     2   153.00000 days                4
#5:          2     d 2012-03-16           <NA> brand1     2          NA days                1
#6:          2     a 2012-03-21     2012-03-16 brand1     2     5.00000 days                2

hi dnlbrky, thanks...the point is i need to further use the brand level datasets as inputs to another set of regressions, so,, i would need to create a loop which creates brand level data frames like i mentioned filtered for rows with sales>0. I dont have variables named as brand1,2,3 etc they are descriptive text columns with names of brands..so i will need to loop this in and thats why i wanted to put the brand colnames in a list and then loop through it..this is what i have difficulty with.. please let me know if i am being unclear — , Sep 02 '13 at 04:01
Hi @kaos1511. I had updated the brand column to use your original names instead of an integer, right before you posted your comment. Did you see that before you posted, or maybe I'm not understanding that part? As for the regression, you can easily do that by model with one table. It should be much more efficient than breaking it up into multiple tables. Maybe you could add the regression example to the original question, and I can modify my answer to show you. — dnlbrky, Sep 02 '13 at 04:06

Henrik · Answer 2 · 2013-09-02T14:55:02.710

Here's an example with the data in a long data frame format instead.

library(reshape2)
library(plyr)


# Prepare data
# melt data
# measured variables given as a vector of variable names
df2 <- melt(data = df,
            measure.vars = paste0("brand", 1:4),
            variable.name = "brand",
            value.name = "sale")

Updated melt following comment from @kaos1511

# handling brand names that are not on the form brand1, brand2, brandn"

# add some fake brand names to df
names(df) <- c("customerid", "date", "store", "Mazda", "Toyota", "Nissan", "Volvo")

    # If data for different brands always come after customerid, date and store
# you can melt data by specifying 'measure variables' by position, like this
# melt data
df2 <- melt(data = df,
            measure.vars = 4:(ncol(df)),
            variable.name = "brand",
            value.name = "sale")

# alternatively, you can specify customerid, date and store as 'id variables'
# melt will then assume that all remainding variables, i.e. all 'brand columns', are measure variables
df2 <- melt(data = df,
            id.vars = c("customerid", "date", "store"),
            variable.name = "brand",
            value.name = "sale")

# remove $ and replace -
df2$sale <- with(df2, gsub(pattern = "$", replacement = "", sale, fixed = TRUE))
df2$sale[df2$sale == "-"] <- 0

# convert to date 
df2$date <- as.Date(df2$date, format = "%d-%m-%Y")

# select rows with sale > 0
df3 <- df2[df2$sale > 0, ]


# Create new variables
# per brand and customerid, create counter and lagdate
# nb, in your last two 'expected output', lagdate does not match.
# my lagdate matches the first of them.
df4 <- ddply(.data = df3, .variables = .(brand, customerid), mutate,
             counter = as.numeric(as.factor(date)),
             lagdate = c(NA, as.character(head(date, -1))))

# order by brand, store and date
df4 <- arrange(df4, brand, store, date)

# per brand and store, calculate days between trips
df5 <- ddply(.data = df4, .variables = .(brand, store), mutate,
             daysbetweentrips = c(NA, diff(date)))

# order by brand, customerid and date
df5 <- arrange(df5, brand, customerid, date)

Henrik-i had quick questions:in my data, i wont have data with $, so the remove $ and replace step isnt needed. Variable names are not in brand1,2,3 order,,they are absolutely random like abc,xyz,def etc..thats why i wondered if i could put the colnames in a loop ,,please Let me know.. also...u have created a brand variable with values 1,2,3,4 i believe..and loaded all into one dataframe...but what i need is i need to run further regressions on the brand level datasets i obtain... and thats why i wanted to put the brand level data frames separately...rather than keep them in one df... — , Sep 02 '13 at 03:44
I think having a single brand variable will work if the # o brands is less.. I have 400 such columns and I think it might inflate the data frame size that gets created. I will need to run a regression with all new variables created.. Although I can run it with filter of brand =1 to n where n is index of last brand name .. Please suggest which is more efficient — , Sep 02 '13 at 05:13
@kaos1511, Next time, please post data that accurately represents your real data. If your data did not have '$' or '-", why post it? You create unnecessary work for people trying to help you. — Henrik, Sep 02 '13 at 08:47
You can easily run a regression on a long data frame grouped by one or more variables, using `ddply`, `data.table` or `base` R functions such as `by`. See e.g. [here](http://stackoverflow.com/questions/18025548/regression-by-subset-in-r). The topic is often discussed on SO, and you will must likely find nice examples that you can adapt to your own case. — Henrik, Sep 02 '13 at 08:56
Sorry about that Henrik, i thought using those might aid in understanding. Will ensure these dont recur from next time.. — , Sep 02 '13 at 09:21
@kaos1511. I have updated my answer to handle more 'irregular' brand names. — Henrik, Sep 02 '13 at 09:32

Creating new variables in grouped data

2 Answers2