0

I have this dataset that I am doing logistic regression against in R. I have several questions about the output which may be a result from my lack of understanding of statistics and R and then a question about reducing the model based on p-value of current output.

The command I was given is:

model = glm(col1 ~ 1+(col2+col3+col4+col5+col6+col7)^2, family=binomial, data=ds)
summary(model)

The columns contain the below data

col1 has values of 0 and 1
col2 is an integer
col3 is an integer
col4 is an integer
col5 has values of horrible, bad, good, excellent
col6 has values of a, b, c
col7 has values of true and false

A segment of the coefficient and p-value output

col1            0.2824
col2            0.3457
col3            0.7845
col4            0.1451
col5horrible        0.0541*
col5bad         0.5641
col5excellent       0.2354
col6a           0.0025**
col6b           0.6245
col7TRUE        0.4145
col1:col2       0.0124*
col1:col3       0.8401
col1:col4       0.3154
col1:col5horrible   0.0054**
col1:col5bad        0.2149
col1:col5excellent  0.0035**
col1:col6a      0.2487
col1:col6b      0.0354*
col1:col7TRUE       0.5647

The first thing I noticed was that for col5, col6, and col7 the output wasn't just the column name like it was for col1, col2, col3, and col4, but was ColnameValue. The second thing I noticed was that for the columns where a value was being appended to the column name, not all possible values were being appended to the column name, but in fact for col5, col6, and col7 were each missing one value.

Questions for understanding

  1. what is ~1 in the glm function? I haven't seen 1 used before so not sure how to read that.
  2. why are column values appended to the column names in the summary output?
  3. why are all the possible values for a column not appended to the column name where the column name has some of the values appended?

Code question

I want to reduce the model to see if it can be better fitted. The suggestion was to remove predictors from the current model that were over a certain p-value. Here is what I have so far, but I am not sure what to do next once I have the column names and how to put the columns with ColnameValue into a model.

p=coef(summary(model))[,4]
colnames=names(p[p<0.1])
colnames

colnames output

"col5horrible"
"col6a"
"col1:col5horrible"
"col1:col5excellent"
"col1:col6b"

What would be my next step or is there a better way to do this? How do I handle the fact that the value is append to the column name?

EDIT

Based on the answer posted by schalange below I looked up dummy variables in R. On this post there were several methods for creating dummy variables. For non-numeric columns: col5, col6, and col7 which all have a predefined set of values I ran the function createDummyFeatures then ran glm on the columns that came out from the original model with p-value < 0.1. Is this the correct approach to reducing the original model based on the p-values for the coefficients?

install.packages("mlr")
library(mlr)

ds<-createDummyFeatures(ds, cols = "col5")
ds<-createDummyFeatures(ds, cols = "col6")
ds<-createDummyFeatures(ds, cols = "col7")

model2 = glm(col1 ~ 1+(col5.horrible + col6.a +col5.excellent + col6.b))
summary(model2)
Community
  • 1
  • 1
soccergal_66
  • 23
  • 1
  • 5

1 Answers1

0

First of all, before estimating a model, you should be more sure about what the variables really mean. R tries to estimate coefficcients for your variables. This is hard to do when the "value" of the column is "horrible". So what R does (and which is a sensible thing to do), it treats those columns, 5,6 and 7, as factors/ as dummy variables. (You might want to google that, you'll find plenty of information.) The Basic idea is that, for example True and False are different groups of data. R estimates something corresponding to an intercept for every group you mentioned. However, this can only be done for k-1 categories of each column, this is way you only have a value for TRUE, but not for FALSE(google dummy trap). I can't be entierely sure because I do not have any detail about the nature of the groups, but in all but rare cases, you cannot keep HORRIBLE without also keeping the other two, EXCELLENT and BAD. Think of them as groups or names, not as values.

One last thing: If you want to have a model that is better fitted (meaning that it can explain more of the variance, that the R^2 is higher), reducing variables is not the way to go, it makes your model less precise. However, it can still be an improvement, because the things that are below a certain threshold are important determinants, and you can be relatively certain they have a true effect. The internet offers plenty of things to help you intepret p-values.

  • thanks for some additional terms to google as I don't know enough of the technical terms and wasn't getting useful searches. I will take a look at dummy trap. I have been trying to find a good resource that can explain logistic regression and the output in R in layman's terms so anything is helpful. In terms of the dataset the values HORRIBLE BAD GOOD and EXCELLENT are rankings of a product, col6 values are the type of a product. That is helpful info about better fitting the model and my understanding was off. – soccergal_66 Mar 29 '17 at 11:19
  • For the assignment I still do need to do a reduced version to compare to the original and suggestion had been to reduce variables based on the p-value of the original model. Your explanation def makes sense that R has no idea how to interpret non-numerical data. Any suggestions on how I leverage only the coefficients that are under a certain p-value? – soccergal_66 Mar 29 '17 at 11:28