R - Distance calculation looping through data frame rows on a time series

Question

just started with R and have not been able to find a fix, have read multiple answers but have not found a suitable one. I am trying to calculate and use correlation as a distance measure between a bunch of stores, so to come up with trail - control pairings to then assess whether a marketing campaign had a significant influence on post sales.

Total sales pre marketing campaign is the metric of interest, I ve got seven months worth of it for each store, and would like to loop through all of them to find the more suitable trial - control pairing for each month. Three are the stores object of the marketing campaign (trial) which was run for three months as well, hence, the necessity to find a good trial - control store match for each month.

Here is what I came up with so far which seems to be working, however, I have yet to understand how to store the results in an handy format I can subsequently use to assess where the highest trial - control store correlation is for each month:

my.fun <- function(trial){
for (store in st.vector) {
  trial <- stores_stats_pre %>% filter(store_nbr == trial) %>% select(total_sales)
  control <- stores_stats_pre %>% filter(store_nbr == store) %>% select(total_sales)
  cor(control$total_sales, trial$total_sales)
}
}

and I would then simply use it as my.fun(trial_store_number)

st.vector contains stores' unique IDs (trial stores were removed to avoid calculating correlation with themselves)

trial_stores <- c(77, 86, 88)
st.vector <- unique(stores_stats_pre$store_nbr)
st.vector <- st.vector[!st.vector %in% trial_stores]

store_stats_pre is a data frame containing a bunch of metrics pre marketing campaign for a total of 260 stores (I included only the first two):

store_stats_pre <- data.frame(
    store_nbr=c(1,1,1,1,1,1,1,2,2,2,2,2,2,2),
    year_month=c('2018-07', '2018-08', '2018-09', '2018-10', '2018-11', '2018-12', '2019-01','2018-07', '2018-08', '2018-09', '2018-10', '2018-11', '2018-12', '2019-01'),
    total_sales=c(206, 176, 278, 188, 192, 189, 154, 150, 193, 155, 168, 163, 136, 159))

I tried creating an empty data frame outside the loop, however, I am unable to understand how I can append/store the correlation and related control store number into it. Ideally, it would look something like this:

results_dataframe <- data.frame(
    Control_nbr = c(1,2,3, etc.),
    Correlation = c(correlation_vs_trial_store)
)

And I would modify my code like this:

results_dataframe <- data.frame(Control_nbr = integer(0), Correlation = integer(0))
my.fun <- function(trial){
for (store in st.vector) {
  trial <- stores_stats_pre %>% filter(store_nbr == trial) %>% select(total_sales)
  control <- stores_stats_pre %>% filter(store_nbr == store) %>% select(total_sales)
  correlation <- cor(control$total_sales, trial$total_sales)
  results_dataframe[Control_nbr] <- store
  results_dataframe[Correlation] <- correlation
}
}

But it doesn't work and I also get an "Error in cor(control$total_sales, trial$total_sales) : incompatible dimensions" message.

Also, I read growing objects inside loops is a bad practice, therefore, I am not sure how I should go about it.

Thanks

Can you provide a reproducible example? https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — william3031, Sep 04 '21 at 10:38
`"Error in cor(control$total_sales, trial$total_sales) : incompatible dimensions"` - lets say that you have ten stores with 7 dates. If you pick four for your trial, that is 28 rows (4x7), but your control has 42 rows (6x7). — william3031, Sep 08 '21 at 00:06

score 0 · Accepted Answer · answered Sep 08 '21 at 00:07

Is that what you are after? I created new test data because yours didn't have enough to work with or the trial stores in the test data.

This will only work if your number of stores in the trial is the same as the number in the control (as noted in the comment in your question).

library(tidyverse)
stores_stats_pre <- data.frame(
  store_nbr = sort(rep(seq(1:10),7)),
  year_month= rep(c('2018-07', '2018-08', '2018-09', '2018-10', '2018-11', '2018-12', '2019-01'), 10),
  total_sales= sample(100:200, 70))

trial_stores <- list(c(1:5), c(1, 2, 4, 8, 9), c(4:8)) %>% 
  set_names()

corr_function <- function(x){
  trial <- stores_stats_pre %>% 
    filter(store_nbr %in% x) %>% # stores in x
    pull(total_sales)
  
  control <- stores_stats_pre %>% 
    filter(!store_nbr %in% x) %>% # stores not in x
    pull(total_sales)
  
  cor(trial, control)
}

map_df(trial_stores, ~(corr_function(.x)), .id = "trial stores") %>% 
  pivot_longer(everything())


# A tibble: 3 x 2
  name               value
  <chr>              <dbl>
1 1:5              -0.219 
2 c(1, 2, 4, 8, 9) -0.133 
3 4:8              -0.0656

Not exactly what I am looking for however since you are the only one who bothered answering I'll Accept it. Thanks — JWill, Sep 24 '21 at 07:49

R - Distance calculation looping through data frame rows on a time series

1 Answers1