-2

Edited to show code I already have...

My job is to predict how much money a movie will make over its first 15 weeks on cable platforms. I do this by using a regression at each week during the first 14 weeks. But I need to automate the steps of calculating each regression:

  1. Subset total data set by week (14 week's total). So 14 distinct data frames.

    df.names = paste("data",1:14,sep="") for(i in 1:14){ d.frame = subset(myData,Week==i) assign(df.names[i],d.frame) }

  2. Subset each week's data frames into training and test sets.

    set.seed(101) train_idx = sample(1:nrow(data1),round(0.7 * nrow(data1)),replace=FALSE) data1_train = data1[train_idx,] data1_test = data1[-train_idx,]

  3. Run a linear regression on the training set for each week.

    Week1_Regress = lm( x ~ coef1 + coef 2 + ... + coefi, data = data1_train)

  4. Extract the coefficients for each regression into a CSV file.

    write.csv(Week1_Regress$coef,"Selected Folder")

  5. Calculate the RMSE using the test set and extract that into a CSV.

    test = predict(Week1_Regress, data1_test) rmse = function(test,obs) { sqrt(sum((obs - test)^2) / length(test)) }

I can do each step individually, but I am looking for a loop or lapply solution so that I don't have to type out 14 versions of the 5 steps.

  • The easiest solution would be to write down your steps for one version, then wrap everything in a loop that iterates over your data set. For instance, lets say you have a tabular data set, with variables in columns, and weeks (or a sub weekly interval) in rows. Then your loop would just increment a counter that points to the current week. It would repeat your 5 steps 14 times, or as much as you want. Also, you should include in your questions what have you tried. In this case, that would be to at least incude the steps (actual code) for one week. – iled Sep 23 '16 at 01:22
  • Pardon my unsolicited advice. Seems to me that a TV executive could benefit from the comparative advantage of hiring a data scientist. – shayaa Sep 23 '16 at 02:08
  • the idea of lapply is to get some input, in your case a `data.frame` and apply a function to it, `lapply(df, some_function)`. Start writing your function and ask for help, so far this isn't a real question and will be closed, downvoted. – marbel Sep 23 '16 at 02:33
  • Thanks for taking the time to look! @shayaa - tell me about it, I'm working on that! – Adeeereeean Sep 23 '16 at 16:13
  • @marbel I can run the regressions using lapply but not sure of the other steps...here is the code I have used but did no include above: `DataList = list(data1,data2,data3,data4 etc...)` I then used: `mod = lapply(DataList, function(y) lm(y ~ coefficients, data=y)`. After this I can get all of the coefficients together using `sapply(mod,coef).` Apologies for not being specific in the beginning @iled – Adeeereeean Sep 23 '16 at 16:17
  • @JiTHiN - does this satisfy the requirement to remove the hold on this question? Thanks – Adeeereeean Sep 23 '16 at 21:31
  • If you include a [minimal, reproducible example -- usually complete with a dataset](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and `dput()` the output of said dataset you will often get very helpful SO replies. – shayaa Sep 23 '16 at 22:35
  • You might find the tidyverse packages useful in accomplishing this. Here's a post that covers something similar: https://drsimonj.svbtle.com/running-a-model-on-separate-groups – Simon Jackson Sep 26 '16 at 05:40

1 Answers1

0

After trawling through SO, I have come up with the following solution which seems to work quite well:

df.names = paste("models",1:14,sep="")
for(i in 1:14){
  set.seed(101)
  data = subset(MyData,Week==i)
  data_select = sample(1:nrow(data),round(0.7 * nrow(data)),replace = FALSE)
  d.frame = data[data_select,]
  model = lm(y ~ independent variables,data=d.frame)
  assign(df.names[i],model$coef)
}

coefficients = data.frame(models1,models2,...,models14)
write.csv(coefficients, "folder destinaton/filename.csv")

Then to get the RMSE's for each of the 14 weeks I replicate the same structure as above, but use a test set by replacing assign(df.names[i],model$coef) with:

 d.frame.test = data[-data_select,]
  test = predict(model,d.frame.test)
  rmse = sqrt(sum(d.frame.test$dependent_variable - test))/nrow(d.frame.test)
  assign(df.names[i],rmse)
}

errors = data.frame(rmse1,rmse2,...,rmse14)
    write.csv(errors,"folder destinaton/errors.csv"