0

I was wondering why lm() says 5 coefs not defined because of singularities and then gives all NA in the summary output for 5 coefficients.

Note that all my predictors are categorical.

Is there anything wrong with my data on these 5 coefficients or code? How can I possibly fix this?

d <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/v.csv", h = T) # Data

nms <- c("Age","genre","Length","cf.training","error.type","cf.scope","cf.type","cf.revision")

d[nms] <- lapply(d[nms], as.factor) # make factor

vv <- lm(dint~Age+genre+Length+cf.training+error.type+cf.scope+cf.type+cf.revision, data = d)

summary(vv) 

First 6 lines of output:

     Coefficients: (5 not defined because of singularities)
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    0.17835    0.63573   0.281 0.779330    
Age1          -0.04576    0.86803  -0.053 0.958010    
Age2           0.46431    0.87686   0.530 0.596990    
Age99         -1.64099    1.04830  -1.565 0.118949    
genre2         1.57015    0.55699   2.819 0.005263 ** 
genre4              NA         NA      NA       NA    ## For example here is all `NA`s? there are 4 more !

2 Answers2

1

As others noted, a problem is that you seem to have multicollinearity. Another is that there are missing values in your dataset. The missing values should probably just be removed. As for correlated variables, you should inspect your data to identify this collinearity, and remove it. Deciding which variables to remove and which to retain is a very domain-specific topic. However, you could if you wish decide to use regularisation and fit a model while retaining all variables. This also allows you to fit a model when n (number of samples) is less than p (number of predictors).

I've shown code below that demonstrates how to examine the correlation structure within your data, and to identify which variables are most correlated (thanks to this answer. I've included an example of fitting such a model, using L2 regularisation (commonly known as ridge regression).

d <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/v.csv", h = T) # Data

nms <- c("Age","genre","Length","cf.training","error.type","cf.scope","cf.type","cf.revision")

d[nms] <- lapply(d[nms], as.factor) # make factor

vv <- lm(dint~Age+genre+Length+cf.training+error.type+cf.scope+cf.type+cf.revision, data = d)


df <- d
df[] <- lapply(df, as.numeric)
cor_mat <- cor(as.matrix(df), use = "complete.obs")

library("gplots")
heatmap.2(cor_mat, trace = "none")

## https://stackoverflow.com/questions/22282531/how-to-compute-correlations-between-all-columns-in-r-and-detect-highly-correlate
library("tibble")
library("dplyr")
library("tidyr")

d2 <- df %>% 
  as.matrix() %>%
  cor(use = "complete.obs") %>%
  ## Set diag (a vs a) to NA, then remove
  (function(x) {
    diag(x) <- NA
    x
  }) %>%
  as.data.frame %>%
  rownames_to_column(var = 'var1') %>%
  gather(var2, value, -var1) %>%
  filter(!is.na(value)) %>%
  ## Sort by decreasing absolute correlation
  arrange(-abs(value))

## 2 pairs of variables are almost exactly correlated!
head(d2)
#>         var1       var2     value
#> 1         id study.name 0.9999430
#> 2 study.name         id 0.9999430
#> 3   Location      timed 0.9994082
#> 4      timed   Location 0.9994082
#> 5        Age   ed.level 0.7425026
#> 6   ed.level        Age 0.7425026
## Remove some variables here, or maybe try regularized regression (see below)
library("glmnet")

## glmnet requires matrix input
X <- d[, c("Age", "genre", "Length", "cf.training", "error.type", "cf.scope", "cf.type", "cf.revision")]
X[] <- lapply(X, as.numeric)
X <- as.matrix(X)
ind_na <- apply(X, 1, function(row) any(is.na(row)))
X <- X[!ind_na, ]
y <- d[!ind_na, "dint"]
glmnet <- glmnet(
    x = X,
    y = y,
    ## alpha = 0 is ridge regression
    alpha = 0)

plot(glmnet)

Created on 2019-11-08 by the reprex package (v0.3.0)

alan ocallaghan
  • 3,116
  • 17
  • 37
  • Thank you but, all my data is categorical!! –  Nov 08 '19 at 16:41
  • What does your data being categorical matter? Linear regression works fine with categorical predictors. – alan ocallaghan Nov 08 '19 at 16:43
  • Thanks I mean in terms of using pearson correlation on categorical data, what is the interpretation of the current heatmap exactly? –  Nov 08 '19 at 16:55
  • It is a heatmap of the correlation coefficients. Each cell represents the pairwise correlation coefficient between two variables in the dataset. Pearson correlation also works on categorical data when converted to numeric -- in fact, Spearman can have difficulties due to ties. – alan ocallaghan Nov 08 '19 at 16:57
  • Also you mentioned there are some missing in the data, but I can find any, could you please tell me where are they? –  Nov 08 '19 at 16:58
  • `d[89, "cf.scope"]` – alan ocallaghan Nov 08 '19 at 17:01
  • Awesome is there a way to find all of such missing? –  Nov 08 '19 at 17:04
  • Slightly depends on the data type, but for matrix or data.frame `which(apply(d, 2, function(x) any(is.na(x))))` will tell you which columns contain missing values. Cheers – alan ocallaghan Nov 08 '19 at 17:11
0

Under such situation you can use "olsrr" package in R for stepwise regression analysis. I am providing you a sample code to do stepwise regression analysis in R

library("olsrr")

#Load the data
d <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/v.csv", h = T)

# stepwise regression 
vv <- lm(dint ~ Age + genre + Length + cf.training + error.type + cf.scope + cf.type + cf.revision, data = d)

summary(vv)  

k <- ols_step_both_p(vv, pent = 0.05, prem = 0.1)

# stepwise regression plot 
plot(k)

# final model 
k$model

It will provide you exactly the same output as that of SPSS.

UseR10085
  • 7,120
  • 3
  • 24
  • 54
  • Can I instead remove the variables that are currently NA, for example `genre4` gives `NA` so is there a ways to remove that specific level of `genre` from data? –  Nov 08 '19 at 16:26
  • @Reza try the edit now, it is working fine after removing the highly correlated variables. – UseR10085 Nov 08 '19 at 17:05
  • `d[nms] <- lapply(d[nms], as.factor)` was creating the problem. Now with the raw data the stepwise regression model is working fine. – UseR10085 Nov 08 '19 at 17:08
  • I see what you mean but so how much correlation is allowed among predictors is there a cutpoint? –  Nov 08 '19 at 17:15
  • Rule of thumb is |correlation| >0.7. Though it may vary according to the dataset. – UseR10085 Nov 08 '19 at 17:21