0

I am trying to implement logistic regression and the function works manually, but for some reason I get the error "Error in nrow(X) : object 'X' not found ", even though X is defined before the nrow command. I use the UCI Data "Adult" to test it.

If i try to run the function manually there is no error. Can anyone explain that?

# Sigmoidfunction
sigmoid <- function(z){
  g <- 1/(1+exp(-z))
  return(g)
}

# Costfunction
cost <- function(theta){
  n <- nrow(X)
  g <- sigmoid(X %*% theta)
  J <- (1/n)*sum((-Y*log(g)) - ((1-Y)*log(1-g)))
  return(J)
}

log_reg <- function(datafr, m){

  # Train- und Testdaten Split
  sample <- sample(1:nrow(datafr), m)
  df_train <- datafr[sample,]
  df_test <- datafr[-sample,]

  num_features <- ncol(datafr) - 1
  num_label <- ncol(datafr)
  label_levels <- levels(datafr[, num_label])
  datafr[, num_features+1] <- ifelse(datafr[, num_label] == names(table(datafr[,num_label]))[1], 0, 1)

  # Predictor variables
  X <- as.matrix(df_train[, 1:num_features])
  X_test <- as.matrix(df_test[, 1:num_features])

  # Add ones to X
  X <- cbind(rep(1, nrow(X)), X)
  X_test <- cbind(rep(1, nrow(X_test)), X_test)

  # Response variable
  Y <- as.matrix(df_train[, num_label] )
  Y <- ifelse(Y == names(table(Y))[1], 0, 1)

  Y_test <- as.matrix(df_test[, num_label] )
  Y_test <- ifelse(Y_test == names(table(Y_test))[1], 0, 1)


  # Intial theta
  initial_theta <- rep(0, ncol(X))

  # Derive theta using gradient descent using optim function
  theta_optim <- optim(par=initial_theta, fn=cost)

  predictions <- ifelse(sigmoid(X_test%*%theta_optim$par)>=0.5, 1, 0)


# Generalization error
error_rate <- sum(predictions!=Y_test)/length(Y_test)

return(error_rate)
}

### Adult Data
data <- read.table('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', 
                    sep = ',', fill = F, strip.white = T)
colnames(data) <- c('age', 'workclass', 'fnlwgt', 'education', 
                    'education_num', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 
                    'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income')

# Featureselection
datafr <- data[, c("age", "education_num", "hours_per_week", "income")]

log_reg(datafr = datafr, m = 20)
Kiril D.
  • 19
  • 4
  • 4
    Hi, you need to make your question [reproducible](https://stackoverflow.com/a/5963610/6574038) for Stack Overflow, cheers. – jay.sf Jul 15 '19 at 08:40
  • Until the line `theta_optim <- ...` it works for me (later: cost not defined). Please post a whole reproducible example which has the problem you define. – January Jul 15 '19 at 08:50

1 Answers1

0

You are calling cost() in which you refer to X, but X has not been defined in cost(). Either define it within log_reg() after you have defined X, or, better, make X a parameter for cost().

cost <- function(theta, X, Y){
  n <- nrow(X)
  g <- sigmoid(X %*% theta)
  J <- (1/n)*sum((-Y*log(g)) - ((1-Y)*log(1-g)))
  return(J)
}

And later

theta_optim <- optim(par=initial_theta, fn=cost, X=X, Y=Y)

In general, try to avoid having variables used in a function which are not defined explicitly as arguments to that function. Otherwise you will always end up with problems like this one.

Also, how did I find it out? I used traceback():

> traceback()
5: nrow(X) at #2
4: fn(par, ...)
3: (function (par) 
   fn(par, ...))(c(0, 0, 0, 0))
2: optim(par = initial_theta, fn = cost) at #33
1: log_reg(datafr = datafr, m = 20)
January
  • 16,320
  • 6
  • 52
  • 74
  • 1
    I also added Y=Y to the optim function and as an input for the cost-function. Now it works. Thanks a lot! Do you may know, why it doesn't work to write ? theta_optim <- optim(par=c(initial_theta, X, Y), fn=cost) – Kiril D. Jul 15 '19 at 09:08
  • Because optim optimizes only one parameter, so it needs only one starting value of the parameter. `par` is the initial, starting value of the parameter to be optimized. – January Jul 15 '19 at 09:11