0

I have this dataframe : https://www.kaggle.com/harlfoxem/housesalesprediction/version/1#kc_house_data.csv and I need to create a linear regression model. when I try to "factor" some features I get this Error : Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : factor grade has new levels 1 And I don't know what to do, I think that I need to "factor" almost every feature that I use but I always get this error

My Code :

house.data.raw <- read.csv('housedata.csv')
library(ggplot2)
house.data.prepared <- house.data.raw

#convert to date type and structure
dates <- house.data.prepared$date
dates <- as.Date(dates, "%Y%m%dT000000")
dates <- format(dates, format="%d-%m-%Y")
house.data.prepared$date <- dates
house.data.prepared$date <- as.Date(house.data.prepared$date, "%d-%m-%Y")

#Remove all columns with one or more rows that contains "NA" 
numberOfNA = length(which(is.na(house.data.prepared) == T))
if(numberOfNA > 0)
{
  cat('Number of missing values: ', numberOfNA)
  cat('\nRemoving missing values...')
  house.data.prepared = house.data.prepared[complete.cases(house.data.prepared), ]
}
house.data.final$bedrooms <- factor(house.data.final$bedrooms)
house.data.final$floors <- factor(house.data.final$floors)
house.data.final$waterfront <- factor(house.data.final$waterfront)
house.data.final$view <- factor(house.data.final$view)
house.data.final$condition <- factor(house.data.final$condition)
house.data.final$grade <- factor(house.data.final$grade)

library(caTools)
filter <- sample.split(house.data.final$bedrooms, SplitRatio = 0.7)

#Training set
house.train <- subset(house.data.final, filter == T)

#test set
house.test <- subset(house.data.final, filter == F)

dim(house.data.final)
dim(house.train)
dim(house.test)

model <- lm(price ~ . ,house.train)
summary(model)

predict.train <- predict(model, house.train)
predict.test <- predict(model, house.test)
  • 2
    In order to make your question reproducible and thus answerable, we need minimal, self-contained code and data so that we are able to reproduce your problem on our machine, please follow these simple guidelines: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610. – jay.sf May 23 '20 at 09:13
  • @jay.sf ok done that, hope it isn't too much code – Yshai Siboni May 23 '20 at 09:19
  • 1
    Why are you converting everything to factors? I looked into your data at kaggle. `bedrooms` appears to be the number of bedrooms. That's not necessarily a factor. – Martin Gal May 23 '20 at 09:26
  • @MartinGal I did it because when I didn't convert them to factor I got this model output : ```Min 1Q Median 3Q Max -1248470 -120890 -15475 96015 4548560 ``` And I don't think it supposed to be like that, unless i'm wrong – Yshai Siboni May 23 '20 at 09:54
  • that's not a model output. Thats a summary of your data. you don't need to factor everything. it doesn't make sense – StupidWolf May 23 '20 at 16:36
  • @stupidWolf I know that I don't need to factor everything, but my data summary doesn't make any sense to me, shouldn't the median be close to 0 or 0? – Yshai Siboni May 23 '20 at 18:49
  • 1
    no.. it's not centered. why should it be zero – StupidWolf May 23 '20 at 21:26

0 Answers0