I'm trying to build a linear regression model using eight independent variables, but when I run lm()
one variable--what I anticipate being my best predictor!--keeps returning NA. I'm still new to R, and I cannot find a solution.
Here are my independent variables:
- TEMPERATURE
- HUMIDITY
- WIND_SPEED
- VISIBILITY
- DEW_POINT_TEMPERATURE
- SOLAR_RADIATION
- RAINFALL
- SNOWFALL
My df is training_set
and looks like:
I'm not sure whether this matters, but training_set
is 75% of my original df, and testing_set
is 25%. Created thusly:
set.seed(1234)
split_bike_sharing <- sample(c(rep(0, round(0.75 * nrow(bike_sharing_df))), rep(1, round(0.25 * nrow(bike_sharing_df)))))
This gave me table(split_bike_sharing)
:
0 | 1 |
---|---|
6349 | 2116 |
And then I did:
training_set <- bike_sharing_df[split_bike_sharing == 0, ]
testing_set <- bike_sharing_df[split_bike_sharing == 1, ]
The structure of training_set
is like:
To create the model I run the code:
lm_model_weather=lm(RENTED_BIKE_COUNT ~ TEMPERATURE + HUMIDITY + WIND_SPEED + VISIBILITY + DEW_POINT_TEMPERATURE +
SOLAR_RADIATION + RAINFALL + SNOWFALL, data = training_set)
However, as you can see the resultant model returns RAINFALL
as NA. Here is the resultant model:
My first thought was to check RAINFALL
datatype, which is numeric with range 0-1 (because at an earlier step I performed min-max normalization). But SNOWFALL
also is numeric, and I've done nothing (that I know of!) to the one but not the other. My second thought was to confirm that RAINFALL
contains enough values to work, and that does not appear to be an issue: summary(training_set$RAINFALL)
:
So, how do I correct the NAs in RAINFALL
? Truly I will be most grateful for your guidance to a solution.
UPDATE 10 MARCH 2022 I've now checked for collinearity:
X <- model.matrix(RENTED_BIKE_COUNT ~ ., data = training_set)
X2 <- caret::findLinearCombos(X)
print(X2)
I believe this means certain columns are jointly multicollinear. As you can see, columns 8, 13, and 38 are:
- [8] is
RAINFALL
- [13] is
SEASONS_WINTER
- [38] is
HOUR_23
Question: if I want to preserve RAINFALL
as a predictor variable (viz., return proper values rather than NAs when I run lm()), what do I do? Remove columns [13] and [38] from the dataset?