1

I'm trying to build a linear regression model using eight independent variables, but when I run lm() one variable--what I anticipate being my best predictor!--keeps returning NA. I'm still new to R, and I cannot find a solution.

Here are my independent variables:

  • TEMPERATURE
  • HUMIDITY
  • WIND_SPEED
  • VISIBILITY
  • DEW_POINT_TEMPERATURE
  • SOLAR_RADIATION
  • RAINFALL
  • SNOWFALL

My df is training_set and looks like: enter image description here

I'm not sure whether this matters, but training_set is 75% of my original df, and testing_set is 25%. Created thusly:

set.seed(1234)
split_bike_sharing <- sample(c(rep(0, round(0.75 * nrow(bike_sharing_df))), rep(1, round(0.25 * nrow(bike_sharing_df)))))

This gave me table(split_bike_sharing):

0 1
6349 2116

And then I did:

training_set <- bike_sharing_df[split_bike_sharing == 0, ]
testing_set <- bike_sharing_df[split_bike_sharing == 1, ] 

The structure of training_set is like: enter image description here

To create the model I run the code:

lm_model_weather=lm(RENTED_BIKE_COUNT ~ TEMPERATURE + HUMIDITY + WIND_SPEED + VISIBILITY + DEW_POINT_TEMPERATURE +

SOLAR_RADIATION + RAINFALL + SNOWFALL, data = training_set)

However, as you can see the resultant model returns RAINFALL as NA. Here is the resultant model:

enter image description here

My first thought was to check RAINFALL datatype, which is numeric with range 0-1 (because at an earlier step I performed min-max normalization). But SNOWFALL also is numeric, and I've done nothing (that I know of!) to the one but not the other. My second thought was to confirm that RAINFALL contains enough values to work, and that does not appear to be an issue: summary(training_set$RAINFALL):

enter image description here

So, how do I correct the NAs in RAINFALL? Truly I will be most grateful for your guidance to a solution.

UPDATE 10 MARCH 2022 I've now checked for collinearity:

X <- model.matrix(RENTED_BIKE_COUNT ~ ., data = training_set)
X2 <- caret::findLinearCombos(X)
print(X2)

This gave me: enter image description here

I believe this means certain columns are jointly multicollinear. As you can see, columns 8, 13, and 38 are:

  • [8] is RAINFALL
  • [13] is SEASONS_WINTER
  • [38] is HOUR_23

Question: if I want to preserve RAINFALL as a predictor variable (viz., return proper values rather than NAs when I run lm()), what do I do? Remove columns [13] and [38] from the dataset?

RKeithL
  • 157
  • 1
  • 9
  • Your model summary tells you "`1 not defined because of singularities`". `RAINFALL` must be a linear combination of some of your other predictors, resulting in a singular matrix. – Gregor Thomas Mar 10 '22 at 04:13
  • Thank you for your response, Gregor. But I don't understand how `RAINFALL` could have become such a combination. It's a standalone column; at no point did I have to generate that column out of other columns' data. Forgive me if I'm asking a dumb question. – RKeithL Mar 10 '22 at 04:17
  • 1
    You're question isn't dumb. Check the correlation matrix of your data `cor(training_set)` to see if it's perfectly correlated with any one variable. There's an excellent answer on the subject [here](https://stats.stackexchange.com/a/70910/7515). – Gregor Thomas Mar 10 '22 at 04:23
  • https://stackoverflow.com/questions/71270248/error-warning-on-running-glm-in-r-coefficients-1-not-defined-because-of-sing/71270467#71270467 – Ben Bolker Mar 10 '22 at 04:25
  • @GregorThomas, `cor(training_set)` gives me an error: "Error in cor(training_set) : 'x' must be numeric". Would it still be productive to run cor() *only* on the numeric columns? Thanks for directing me to the explanation at that link (which I definitely needed to read twice!). – RKeithL Mar 10 '22 at 04:42
  • Thank you, @BenBolker, for directing me to the other question and your reply there. If the suspicion is that my `RAINFALL` is either completely or nearly collinear with another predictor variable, should I just try removing the other predictors in the lm() one at a time until `RAINFALL` *doesn't* return as NAs? Or perhaps I've misunderstood your linked answer. Thank you again. – RKeithL Mar 10 '22 at 04:47
  • Could you provide small part of your data so that other users can reproduce your error message? I can't open your images. It's weird that my internet provider is now blocking the website used to store your images. – Abdur Rohman Mar 10 '22 at 05:00
  • The code in the linked answer gives an explicit recipe for finding out which combinations of variables in your model are multicollinear ... – Ben Bolker Mar 10 '22 at 14:59
  • @BenBolker, I've updated my original question in light of reviewing the code in your linked answer. I appreciate your help on this. Two months ago I had never even heard of 'R'! – RKeithL Mar 10 '22 at 17:03

0 Answers0