I've recently had to use R Studio for a project at work, which I've never had to use before. I understand the statistics part as I come from a financial mathematics background, however, my boss is now asking me a lot of questions about R Studio and what it's doing and how, etc.
Anyway, I got a logistic regression to work and it confirms the coefficients he has found when building a logistical regression in Excel. Now, however, I'm being asked about why the coefficients found in a k-fold cross validation are different to what he found in Excel.
I followed the method in this video exactly (this guy uses more independent variables than me but it's a logistic regression): video. The code I got (following the aforementioned video) looks like below (with details removed). This works, but as I mentioned earlier, it generates positive coefficients whilst my boss gets negative coefficients (all small but opposite signs).
library(readxl)
library(tidyverse)
library(caret)
library(e1071)
### Importing the data file
my_data <- read_excel("blahblahblah.xlsx", sheet = 1)
str(my_data)
## Convert my_data into a data frame (from tibble)
my_data <- as.data.frame(my_data)
## Partitioning the data: Create index matrix of selected values
# Set random seed
set.seed(1000000)
# Create index matrix
index <- createDataPartition(my_data$Depvar, p = .8, list=FALSE, times = 1)
### Create train_df and test_df
train_df <- my_data[index,]
test_df <- my_data[-index,]
# Re-label values of outcome (1 = insolvent, 0 = solvent)
train_df$Depvar[train_df$Depvar==1] <- "Insolvent"
train_df$Depvar[train_df$Depvar==0] <- "Solvent"
test_df$Depvar[test_df$Depvar==1] <- "Insolvent"
test_df$Depvar[test_df$Depvar==0] <- "Solvent"
# Convert outcome variable to type factor
train_df$Depvar <- as.factor(train_df$Depvar)
test_df$Depvar <- as.factor(test_df$Depvar)
### Specify training method and number of folds
ctrlspecs <- trainControl(method="cv", number = 5,
savePredictions = "all",
classProbs = TRUE)
# Set random seed
set.seed(1000000)
### Specify logit model
model1 <- train(Depvar ~ indvar1 + indvar2,
train_df,
method="glm", family = binomial,
trControl = ctrlspecs)
print(model1)
Is there anything I am doing wrong here and getting erroneous results? Or is the method different in Excel such that I wouldn't get the same results?
EDIT: the use of k-fold cross-validation, the seed number, fold number, etc., were all at the insistence of my boss.
EDIT2: Apologies for not including an MRE. I'm not entirely sure how to do that, but the dependent variable is about 96000 1s and 0s, 1 being insolvent and zero being solvent. The independent variables are assets values and profit values which go from 0 up to billions.