How do we run a linear regression with the given data?

Question

We have a large data set with 26 brands, sold in 93 stores, during 399 weeks. The brands are still divided into sub brands (f.ex.: brand = Colgate, but sub brands(556) still exist: Colgate premium white/ Colgate extra etc.) We calculated for each Subbrand a brandshared price on a weekly store level: Calculation: (move per ounce for each subbrand and every single store weekly) DIVIDED BY (sum for move per ounce over the subbrands refering to one brand for every single store weekly)* (log price per ounce for each sub brand each week on storelevel)

Everything worked! We created a data frame with all the detailed calculation (data = tooth4) Our final interest is to run a linear regression to predict the influence of price on the move variable --> the problem now is that the sale variable (a dummy, which says if there is a promotion in a specific week for a specific sub brand in a specific store ) is on subbrandlevel --> we tried to run a regression on sub brand level (variable = descrip) but it doesn't work due to big data

lm(formula = logmove_ounce ~ log_wei_price_ounce + descrip - 1 * 
    (log_wei_price_ounce) + sale - 1, data = tooth4)

logmove_ounce = log of weekly subbrand based move on store level 
log_wei_price_ounce = weighted subbrand based price for each store for each week
sale-1 = fixed effect for promotion 
descrip-1 = fixed effect for subbrand

Does anyone have a solution how to run a regression only on brand level but include the promotion variable ? We got a hint that we could calculate a shared value of promotion for each brand on each store ? But how? Another question, assuming my regression is right/ partly right -- how can I weight the results to get the results only on store level not weekly storelevel?

Thank you in advance !!!

Welcome to SO. Please read [how to ask](https://stackoverflow.com/help/how-to-ask) and [How to make a great R reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Currently it is not easy to help you out, without error messages, a code example or anything of the sorts. — Oliver, May 14 '19 at 17:59
This seems like a conceptual stats question, not a programming question. I think it is off-topic for Stack Overflow, more appropriate at stats.stackexchange or datascience.stackexchange — Gregor Thomas, May 14 '19 at 18:00
What do you mean "Does anyone have a solution how to run a regression only on brand level but include the promotion variable?" It is ambiguous without data. — akash87, May 14 '19 at 18:29

score 0 · Answer 1 · answered May 14 '19 at 19:16

We got a hint that we could calculate a shared value of promotion for each brand on each store ? But how?

This is variously called a multilevel model, a nested model, hierarchical model, mixed model, or random-effect model which are all the same mathematical model. It is widely used to analyze the kind of longitudinal panel data you describe. A serious book on the subject is Gelman.

The most common approach in R is to use the lmer() function from the lme4 package. If you're using lme4 on uncomfortably large data, you should read their performance tips.

lmer() models accept a slightly different formula syntax, which I'll describe only briefly so that you can see how it can solve the problems you're having.

For example, let's assume we're modeling future salary as a function of the GPA and IQ of certain students. We know that students come from certain schools, so all students which go to the same school are part of a group, and schools are again grouped into counties, states. Furthermore, students graduate in different years which may have an effect. This is a generic example, but I chose it because it shares many of the same characteristics as your own longitudinal panel data.

We can use the generalized formula syntax to specify groups with a varying intercept:

lmer(salary ~ gpa + iq + (1|school), data=df)

A nested hierarchy of such groups:

lmer(salary ~ gpa + iq + (1|state/county/school), data=df)

Or group-varying slopes to capture changes overtime:

lmer(salary ~ gpa + iq + (1 + year|school), data=df)

You'll have to make your own decisions about how to model your data, but lme4::lmer() will give you a larger toolbox than lm() for dealing with groups and levels. I'd recommend asking on https://stats.stackexchange.com/ if you have questions about the modeling side.

How do we run a linear regression with the given data?

1 Answers1