0

I am trying to perform a linear regression with gradient descent (batch update) in R. I have created the following code using the Bike-Sharing-Dataset from the UCI Machine Learning Repository:

data <- read.csv("Bike-Sharing-Dataset/hour.csv")

# Select the useable features
data1 <- data[, c("season", "mnth", "hr", "holiday", "weekday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed", "cnt")]

# Examine the data structure
str(data1)

summary(data1)

# Linear regression
# Set seed
set.seed(100)

# Split the data
trainingObs<-sample(nrow(data1),0.70*nrow(data1),replace=FALSE)

# Create the training dataset
trainingDS<-data1[trainingObs,]

# Create the test dataset
testDS<-data1[-trainingObs,]

# Create the variables
y <- trainingDS$cnt
X <- as.matrix(trainingDS[-ncol(trainingDS)])

int <- rep(1, length(y))

# Add intercept column to X
X <- cbind(int, X)

# Solve for beta
betas <- solve(t(X) %*% X) %*% t(X) %*% y

# Round the beta values
betas <- round(betas, 2)

print(betas)

gradientR <- function(y, X, epsilon, eta, iters){
  epsilon = 0.0001
  X = as.matrix(data.frame(rep(1,length(y)),X))
  N = dim(X)[1]
  print("Initialize parameters...")
  theta.init = as.matrix(rnorm(n=dim(X)[2], mean=0,sd = 1)) # Initialize theta
  theta.init = t(theta.init)
  e = t(y) - theta.init%*%t(X)
  grad.init = -(2/N)%*%(e)%*%X
  theta = theta.init - eta*(1/N)*grad.init
  l2loss = c()
  for(i in 1:iters){
    l2loss = c(l2loss,sqrt(sum((t(y) - theta%*%t(X))^2)))
    e = t(y) - theta%*%t(X)
    grad = -(2/N)%*%e%*%X
    theta = theta - eta*(2/N)*grad
    if(sqrt(sum(grad^2)) <= epsilon){
      break
    }
  }
  print("Algorithm converged")
  print(paste("Final gradient norm is",sqrt(sum(grad^2))))
  values<-list("coef" = t(theta), "l2loss" = l2loss)
  return(values)
}

gradientR(y, X, eta = 100, iters = 1000)

However, when I try to run this algorithm I get the following error:

[1] "Initialize parameters..." Error in if (sqrt(sum(grad^2)) <= epsilon) { : missing value where TRUE/FALSE needed

I need help understanding this error and how to fix it. Also, is there a more efficient way to implement the algorithm without using any of R's standard packages and libraries?

zsad512
  • 861
  • 3
  • 15
  • 41
  • 1
    Your condition seems to result in NA: https://stackoverflow.com/questions/7355187/error-in-if-while-condition-missing-value-where-true-false-needed Do you have any missing data in your dataset? – Harald Gliebe Sep 06 '17 at 15:13
  • @Harald, there is no missing data- I also looked at `is.na(x)` which returns `FALSE` for all values as far as I can tell – zsad512 Sep 06 '17 at 15:47
  • can you create a [mcve][https://stackoverflow.com/help/mcve] with some dummy data. – and-bri Sep 06 '17 at 15:59
  • @and-bri my question is as minimal, complete, and verifiable as I can make it because Im not really sure what aspect of the code is throwing the error...that is why I have included the dataset that is being used in my question (I updated it to provide the location for download). This dataset has been transformed and Im not really sure if the data structure or the function is causing the error. – zsad512 Sep 06 '17 at 16:21
  • We don't have the file you load in the first line "Bike-Sharing-Dataset/hour.csv" which make it hard to reproduce. Furthermore there are function like `summary()` and `str()` which have no influence on your treatment; they make it only difficult to understand your code. – and-bri Sep 06 '17 at 16:30
  • 1
    I suggest you give us the data for `y` and `X` from this line `gradientR(y, X, eta = 100, iters = 1000)` in a form like this `y <- c(1:5)` or `X <- matrix(c(1:20), ncol=2)` – and-bri Sep 06 '17 at 16:32

0 Answers0