1

I have data like that below:

data.frame':    1460 obs. of  81 variables:
 $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
 $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
 $ MSZoning     : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
 $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
 $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
 $ Street       : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
 $ Alley        : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
 $ LotShape     : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
 $ LandContour  : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ Utilities    : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
 $ LotConfig    : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
 $ LandSlope    : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
 $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
 $ Condition1   : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
 $ Condition2   : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
 $ BldgType     : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
 $ HouseStyle   : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
 $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
 $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
 $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
 $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
 $ RoofStyle    : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ RoofMatl     : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Exterior1st  : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
 $ Exterior2nd  : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
 $ MasVnrType   : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 2 3 2 3 4 4 3 3 ...
 $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
 $ ExterQual    : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 4 3 4 3 4 4 4 ...
 $ ExterCond    : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ Foundation   : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
 $ BsmtQual     : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 3 3 4 3 3 1 3 4 4 ...
 $ BsmtCond     : Factor w/ 4 levels "Fa","Gd","Po",..: 4 4 4 2 4 4 4 4 4 4 ...
 $ BsmtExposure : Factor w/ 4 levels "Av","Gd","Mn",..: 4 2 3 4 1 4 1 3 4 4 ...
 $ BsmtFinType1 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 3 1 3 1 3 3 3 1 6 3 ...
 $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
 $ BsmtFinType2 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 6 6 6 6 6 6 6 2 6 6 ...
 $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
 $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
 $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
 $ Heating      : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ HeatingQC    : Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 1 3 1 1 1 1 3 1 ...
 $ CentralAir   : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
 $ Electrical   : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 2 5 ...
 $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
 $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
 $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
 $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
 $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
 $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
 $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
 $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
 $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
 $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
 $ KitchenQual  : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 3 4 3 4 4 4 ...
 $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
 $ Functional   : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 3 7 ...
 $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
 $ FireplaceQu  : Factor w/ 5 levels "Ex","Fa","Gd",..: NA 5 5 3 5 NA 3 5 5 5 ...
 $ GarageType   : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
 $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
 $ GarageFinish : Factor w/ 3 levels "Fin","RFn","Unf": 2 2 2 3 2 3 2 2 3 2 ...
 $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
 $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
 $ GarageQual   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 2 3 ...
 $ GarageCond   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ PavedDrive   : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
 $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
 $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
 $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
 $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
 $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ PoolQC       : Factor w/ 3 levels "Ex","Fa","Gd": NA NA NA NA NA NA NA NA NA NA ...
 $ Fence        : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA 3 NA NA NA NA ...
 $ MiscFeature  : Factor w/ 4 levels "Gar2","Othr",..: NA NA NA NA NA 3 NA 3 NA NA ...
 $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
 $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
 $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
 $ SaleType     : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
 $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
 $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

I would like to make a GLM to predict SalePrice from all of the other variables.

After I write this:

cena_nieruchomości.lm <- glm(SalePrice~.,
   data=nieruchimości,family=binomial(logit))

I am getting an error:

contrasts can be applied only to factors with 2 or more levels.

I have read that it might occur because of NA values in my data. So I tried:

cena_nieruchomości.lm <- glm(SalePrice~.,
  data=nieruchimości,family=binomial("logit"), na.action=na.pass)

Then I get the next error:

Error in glm.fit(x = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

Could someone please tell what I'm doing wrong and how to avoid this error? Could it be because SalePrice is int (should it be a factor?)

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • Hi, I posted a link which would likely be helpful in your previous question https://stackoverflow.com/questions/59522491/how-to-avoid-naa-in-r-regression. Did you work through the suggestions? If so, can you edit your question to say where you are still stuck please. – user20650 Dec 29 '19 at 21:01
  • Sorry Ben, I cant see that link. Could you post here? – Firts_is_science Dec 30 '19 at 05:45
  • Why would you estimate a binomial model when your dependent variable is rather continuous? Consider that `glm.fit` treats `"SalePrice"` like logical, i.e. everything > 0 is `TRUE` and everything = 0 is `FALSE`. Since all values seem to be > 0 you get the `Error in glm.fit(x = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, `. You may want to set `family=gaussian()` or use `lm` as stated in @Kreuni's answer. – jay.sf Dec 30 '19 at 06:17
  • 1
    As i told, I am really new to this, I will try to do this and inform how it worked :) thanks in advance! :) – Firts_is_science Dec 30 '19 at 06:42
  • Unfortunatelly, I use code: price <- glm(SalePrice~., data=nieruchomości, family=gaussian, na.action=na.pass) and i get the same error: Error in glm.fit... – Firts_is_science Dec 30 '19 at 18:32

1 Answers1

1

SalePrice is an interval/continuous variable. family=binomial('logit') in your glm() call is for fitting logistic regression which assumes you have a dependent variable that only takes on two values.

Given your dependent variable logistic regression is not the right choice. You would do better with just estimating a linear model with lm():

cena_nieruchomości.lm <- lm(SalePrice~.,
   data=nieruchimości)
Kreuni
  • 302
  • 1
  • 6
  • Also, probably, there isn't enough data to estimate so many parameters. 80 predictors plus intercept at around at least 20 observations per parameter equals 1600+. And the OP only has 1460 observations. – Rui Barradas Dec 29 '19 at 22:44
  • 1
    20 predictors per coefficient is not a hard and fast rule, since 10 or 15 per predictors often gives meaningful results, but mindlessly throwing data at modeling functions with no theory and no investigation of the relationship among predictors is definitely horrible statistical practice. – IRTFM Dec 30 '19 at 00:33
  • I know it is a lot of var, but the plan is to reduce amount of them with step forward method. I cant do this before i use glm properly. So, should I delete „family=binomial(logit)” use lm instead of glm and it should work? – Firts_is_science Dec 30 '19 at 05:40
  • Yes if you use `lm()` it will work. You can also delete the family argument, at which point the default is to estimate the same thing as `lm()` would. That said, you might want to do some more stats reading before going any further. – Kreuni Dec 30 '19 at 19:23
  • (1/2) I'm going to chime in here: (1) switching to `lm()` for a continuous response will be necessary, but not sufficient. Your original error message refers to the fact that, once responses with missing values for some predictors are discarded, some of your categorical variables have only a single level remaining: see [this canonical question](https://stackoverflow.com/questions/44200195/how-to-debug-contrasts-can-be-applied-only-to-factors-with-2-or-more-levels-er). – Ben Bolker Dec 30 '19 at 19:46
  • (2/2) (2) Using na.action=na.pass will not help: it just passes the missing value through to the fitting function, which will then break. You either need to drop categorical predictors with not enough levels; drop predictors with lots of missing values (which lead to lots of observations being discarded); do some kind of imputation of the missing values; or switch to a method like random forests that's more robust to missing values. (3) stepwise regression/feature selection is dominated by lots of other methods, such as penalized (ridge/lasso) regression ... – Ben Bolker Dec 30 '19 at 19:48
  • Really thank you Ben. To understand everything clear I should: 1 somehow deal with predictor with only a single level (but in my "str(data)" I cant see one), 2 drop categorial predictors with lots of missing values, 3 do some kind of imputation of the missing values - it can't be done with any NA's? It is my university project and we have to use linear regression :) – Firts_is_science Dec 30 '19 at 20:33
  • 1
    Please read the answers to [this question](https://stackoverflow.com/questions/44200195/how-to-debug-contrasts-can-be-applied-only-to-factors-with-2-or-more-levels-er) carefully: you will probably have predictors with a single level **after** NA values are dropped. Try `rdata <- mydata[complete.cases(mydata),]` and then look. Linear regression can't be done with NAs; they need to be dropped **or** imputed. – Ben Bolker Dec 30 '19 at 21:36
  • Ok, I will try to follow this step and inform how it worked. Thanks a lot. – Firts_is_science Dec 31 '19 at 12:33
  • Hi in 2020! I have dealt with NA's in my data using kNN (VIM), and then finally I run a model without any error! model <- lm(SalePrice~, data = nieruchomości). – Firts_is_science Jan 01 '20 at 17:59