Multiple Regression - Error in model.frame.default variable lengths differ

Question

I'm trying to run a multiple regression with 3 independent variables, and 3 dependent variables. The question is based on how water quality influences plankton abundance in and between 3 different locations aka guzzlers. With water quality variables being pH, phosphates, and nitrates. Dependent/response variables would be the plankton abundance in each 3 locations.

Here is my code:

model1 <- lm(cbind(Abundance[Guzzler.. == 1], Abundance[Guzzler.. == 2], 
                   Abundance[Guzzler.. == 3]) ~ Phospates + Nitrates + pH, 
             data=WQAbundancebyGuzzler)

And this is the error message I am getting:

Error in model.frame.default(formula = cbind(Abundance[Guzzler.. == 1],  : 
  variable lengths differ (found for 'Phospates')

I think it has to do with how my data is set up but I'm not sure how to go about changing this to get the model to run. What I'm trying to see is how these water quality variables are affecting the abundance in the different locations and how they vary between. So it doesn't seem quite logical to try multiple models which was my only other thought.

Here is the output from dput(head(WQAbundancebyGuzzler)):

    structure(list(ï..Date = structure(c(2L, 4L, 1L, 3L, 5L, 2L), .Label = c("11/16/2018", 
"11/2/2018", "11/30/2018", "11/9/2018", "12/7/2018"), class = "factor"), 
    Guzzler.. = c(1L, 1L, 1L, 1L, 1L, 2L), Phospates = c(2L, 
    2L, 2L, 2L, 2L, 1L), Nitrates = c(0, 0.3, 0, 0.15, 0, 0), 
    pH = c(7.5, 8, 7.5, 7, 7, 8), Air.Temp..C. = c(20.8, 25.4, 
    20.9, 16.8, 19.4, 27.4), Relative.Humidity... = c(62L, 31L, 
    41L, 59L, 59L, 43L), DO2.Concentration..mg.L. = c(3.61, 4.48, 
    3.57, 5.65, 2.45, 5.86), Water.Temp..C. = c(14.1, 11.5, 11.8, 
    13.9, 11.1, 17.8), Abundance = c(98L, 43L, 65L, 55L, 54L, 
    29L)), .Names = c("ï..Date", "Guzzler..", "Phospates", "Nitrates", 
"pH", "Air.Temp..C.", "Relative.Humidity...", "DO2.Concentration..mg.L.", 
"Water.Temp..C.", "Abundance"), row.names = c(NA, 6L), class = "data.frame")

Welcome to _Stack Overflow_! To help you we need something to reproduce your issue. i.e. *working* code and example data. There are several [*ways to provide data*](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610), probably adding the output of `dput(WQAbundancebyGuzzler)` or `dput(head(WQAbundancebyGuzzler))` to your question is sufficient. Avoid adding code or alphanumeric output as images. Consider [*How to make a great reproducible example*](https://stackoverflow.com/help/mcve) and edit your question, thanks. — jay.sf, Mar 02 '19 at 07:56
@jay.sf thanks so much for the tip! I've gone ahead and added the output of dput(head(WQAbundancebyGuzzler)) — Victoria Assad, Mar 02 '19 at 08:05
You should instead be using `Guzzler..` as a factor on the RHS. — IRTFM, Mar 02 '19 at 08:37

score 2 · Answer 1 · answered Mar 02 '19 at 08:27

I think the problem here is more theoretical: You say that you have three dependent variables that you want to enter into a multiple linear regression. However, at least in classic linear regression, there can only be one dependent variable. There might be ways around this, but I think in your case, one dependent variable works just fine: It's `Abundance´. Now you you have sampled three different locations: One solution to account for this could be to just enter the location as a categorical independent variable. So I would propose the following model:

# Make sure that Guzzler is not treated as numeric
WQAbundancebyGuzzler$Guzzler <- as.factor(WQAbundancebyGuzzler$Guzzler)

# Model with 4 independent variables
model1 <- lm(Abundance ~ Guzzler + Phospates + Nitrates + pH, 
             data=WQAbundancebyGuzzler)

It's probably also wise to think about possible interactions here, especially between Guzzler and the other independent variables.

jay.sf · Answer 2 · 2019-03-02T08:17:40.217

The reason for your error is, that you try to subset only "Abundance" but not the other variables. So as a result their lenghts differ. You need to subset the whole data, e.g.

lm(Abundance ~ Phospates + Nitrates + pH, 
   data=WQAbundancebyGuzzler[WQAbundancebyGuzzler$Abundance %in% c(1, 2, 3), ])

With given head(WQAbundancebyGuzzler)

lm(Abundance ~ Phospates + Nitrates + pH, 
   data=WQAbundancebyGuzzler[WQAbundancebyGuzzler$Abundance %in% c(29, 43, 65), ])

results in

# Call:
#   lm(formula = Abundance ~ Phospates + Nitrates + pH, data = WQAbundancebyGuzzler
#   [WQAbundancebyGuzzler$Abundance %in% 
#       c(29, 43, 65), ])
# 
# Coefficients:
#   (Intercept)    Phospates     Nitrates           pH  
#         -7.00        36.00       -73.33           NA

Multiple Regression - Error in model.frame.default variable lengths differ

2 Answers2