Multiple sampling inside an R function

Question

I am trying to make a function that in the end will run multiple machine learning algorithms on my data set. I have the first little bit of my function below and a small sample of data.

The problem i am running into is with sampling my data into four different data frames and then applying them to the given functions. Here on the first function i am testing the data runs threw the logistic regression model but on the output it uses all the data for that model and not just 1/4 of the data frame df as i am intending. I checked with <<- to see what kind of data is being passed threw and it sends a data set that is 1/4 of the data frame df that i am looking for. Question why douse it pass to my global environment the right way but not my regression function and how would i correct this?

Data:

zeroFac <- c(1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1)

goal <- c(8.412055,  7.528869,  8.699681, 10.478752,  9.210440, 10.308986, 10.126671, 11.002117, 10.308986,  7.090910, 10.819798,  7.824446,  8.612685,
7.601402, 10.126671,  7.313887,  5.993961,  7.313887,  8.517393, 12.611541)

City_Pop <- c( 11.64613, 11.64613, 11.64613, 11.64613, 11.64613, 11.64613, 11.64613, 11.64613, 11.64613, 11.64613, 11.64613, 11.64613, 11.64613, 11.64613,
11.64613, 11.64613, 11.64613, 11.64613, 11.64613, 11.64613)

df <- data.frame(zeroFac,goal,City_Pop)

Function:

forestModel <- function(eq1, ...){

  #making our origenal data frame
  train <- data.frame(cbind(...))

  ################

    #splitting into 4 data sets
    set.seed(123)

    ss <- sample(1:4, size = nrow(train), replace=TRUE, prob = c(0.25,0.25,0.25,0.25))

    t1 <- train[ss==1,]
    t2 <- train[ss==2,]
    t3 <- train[ss==3,]
    t4 <- train[ss==4,]

  ################

  m <- glm(eq1, family = binomial(link = 'logit'), data = t1)
  summary(m)

}

eq1 <- df$zeroFac ~ df$goal + df$City_Pop


forestModel(eq1, df$zeroFac, df$goal, df$City_Pop)

In the output of the logistic regression it tells me that it is using all observations not just a quarter of them. — Clinton Woods, May 28 '18 at 05:20

kangaroo_cliff · Answer 1 · 2018-05-28T23:38:16.897

In train, the column names are not what you expect ("zeroFac", "goal" and "City_Pop") them to be; they are "X1", "X2" and "X3".

According to the glm help, when the variables in the formula are not available in the data, they are taken from the environment(formula). Hence, it is using the data in the global environment - where the formula is created.

From the ?glm

data an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which glm is called.

Your formula is also incorrect. It should be of the form eq1 <- zeroFac ~ goal + City_Pop. But, correcting that alone would NOT fix your problem.

EDIT

One option would be to pass the names of the variable separately, as in

forestModel <- function(eq1, colnam, ...) {

  train <- data.frame(cbind(...))
  colnames(train) <- colnam

  # splitting the data
  set.seed(123)

  ss <- sample(1:4, size = nrow(train), replace=TRUE, 
                    prob = c(0.25,0.25,0.25,0.25))

  t1 <- train[ss==1,]

  m <- glm(eq1, family = binomial(link = 'logit'), data = t1)
  summary(m)
}

eq1 <- zeroFac ~ goal + City_Pop
colnam <- c("zeroFac", "goal", "City_Pop")

forestModel(eq1, colnam, df$zeroFac, df$goal, df$City_Pop)

# Call:
#   glm(formula = eq1, family = binomial(link = "logit"), data = t1)
# 
# Deviance Residuals: 
#   2           4           5           8          11          16  
# 9.915e-06   2.110e-08  -1.080e-05  -2.110e-08   2.110e-08   2.110e-08  
# 20  
# 6.739e-06  
# 
# Coefficients:
#   Estimate Std. Error z value Pr(>|z|)
# (Intercept)    -960.87 2187192.38       0        1
# goal             12.32   41237.80       0        1
# City_Pop         74.28  166990.04       0        1
# 
# (Dispersion parameter for binomial family taken to be 1)
# Null deviance: 8.3758e+00  on 6  degrees of freedom
# Residual deviance: 2.6043e-10  on 4  degrees of freedom
# AIC: 6
# Number of Fisher Scoring iterations: 25

SeGa · Accepted Answer · 2018-05-28T06:23:24.030

2

You have to change the formula and name the columns of the train dataset in the function. The equation changes from eq1 <- df$zeroFac ~ df$goal + df$City_Pop to eq1 <- zeroFac ~ goal + City_Pop. Otherwise it also contains the call to the dataframe and not just to the column names. And after binding the train-data together, you have to name their columns, so the glm function knows which columns you are reffering to in the equation.

 forestModel <- function(eq1, ...){

  #making our origenal data frame

  train <- data.frame(cbind(...))
  colNames <- colnames(data.frame(...))
  coln <- do.call(cbind, lapply(X = strsplit(colNames, "\\."), FUN = function(X) X[[2]]))
  colnames(train) <- coln

  ################

  #splitting into 4 data sets
  set.seed(123)

  ss <- sample(1:4, size = nrow(train), replace=TRUE, prob = c(0.25,0.25,0.25,0.25))

  t1 <- train[ss==1,]
  ################

  m <- glm(eq1, family = binomial(link = 'logit'), data = t1)
  summary(m)
}

eq1 <- zeroFac ~ goal + City_Pop
forestModel(eq1, df$zeroFac, df$goal, df$City_Pop)

edited May 28 '18 at 06:23

answered May 28 '18 at 05:43

SeGa

9,454
3
31
70

You are correct about the mistake. But, it still would not give the correct answer. See my answer below. – kangaroo_cliff May 28 '18 at 05:45
I think it do gives the correct result. Checking the deviance.resid of the summary gives me less than 20 elements (Less than the original data.frame) – SeGa May 28 '18 at 05:53
I didn't notice `colnames(train) <- c("zeroFac", "goal", "City_Pop")` when I made the comment. I agree - it should work. I think the following statement is not true **Otherwise it also contains the call to the dataframe and not just to the column names.** Even if you had used the formula correctly, as in `eq1 <- zeroFac ~ goal + City_Pop`, it will still use the data in the global environment. – kangaroo_cliff May 28 '18 at 06:02
A problem though arises. as i am using ... how would you change the column names if one were to add new variables to the equation without changing the function? – Clinton Woods May 28 '18 at 06:02
@Suren. You can try out with both formula equations, once with `df$...` and once only the column names. The outputs will be different. @Clinton. I am trying to figure that out, although I am not working a lot with the ellipsis function. – SeGa May 28 '18 at 06:10
@ClintonWoods I edited my answer. It finds out the column names without the need to give them explicitly as function argument. – SeGa May 28 '18 at 06:24

Multiple sampling inside an R function

2 Answers2