0

I've recently had to use R Studio for a project at work, which I've never had to use before. I understand the statistics part as I come from a financial mathematics background, however, my boss is now asking me a lot of questions about R Studio and what it's doing and how, etc.

Anyway, I got a logistic regression to work and it confirms the coefficients he has found when building a logistical regression in Excel. Now, however, I'm being asked about why the coefficients found in a k-fold cross validation are different to what he found in Excel.

I followed the method in this video exactly (this guy uses more independent variables than me but it's a logistic regression): video. The code I got (following the aforementioned video) looks like below (with details removed). This works, but as I mentioned earlier, it generates positive coefficients whilst my boss gets negative coefficients (all small but opposite signs).

library(readxl)
library(tidyverse)
library(caret)
library(e1071)

### Importing the data file
my_data <- read_excel("blahblahblah.xlsx", sheet = 1)

str(my_data)

## Convert my_data into a data frame (from tibble)

my_data <- as.data.frame(my_data)

## Partitioning the data: Create index matrix of selected values

# Set random seed

set.seed(1000000)

# Create index matrix

index <- createDataPartition(my_data$Depvar, p = .8, list=FALSE, times = 1)

### Create train_df and test_df

train_df <- my_data[index,]
test_df <- my_data[-index,]

# Re-label values of outcome (1 = insolvent, 0 = solvent)

train_df$Depvar[train_df$Depvar==1] <- "Insolvent"
train_df$Depvar[train_df$Depvar==0] <- "Solvent"
test_df$Depvar[test_df$Depvar==1] <- "Insolvent"
test_df$Depvar[test_df$Depvar==0] <- "Solvent"

# Convert outcome variable to type factor

train_df$Depvar <- as.factor(train_df$Depvar)
test_df$Depvar <- as.factor(test_df$Depvar)

### Specify training method and number of folds

ctrlspecs <- trainControl(method="cv", number = 5,
                          savePredictions = "all", 
                          classProbs = TRUE)

# Set random seed
set.seed(1000000)

### Specify logit model

model1 <- train(Depvar ~ indvar1 + indvar2,
                train_df,
                method="glm", family = binomial,
                trControl = ctrlspecs)

print(model1)

Is there anything I am doing wrong here and getting erroneous results? Or is the method different in Excel such that I wouldn't get the same results?

EDIT: the use of k-fold cross-validation, the seed number, fold number, etc., were all at the insistence of my boss.

EDIT2: Apologies for not including an MRE. I'm not entirely sure how to do that, but the dependent variable is about 96000 1s and 0s, 1 being insolvent and zero being solvent. The independent variables are assets values and profit values which go from 0 up to billions.

StMatthias
  • 19
  • 5
  • 1
    Can't speak to the specifics of your data (no MRE included) but I would suggest plotting your data to show whether or not your positive coefficients are correct or not (do you observe a positive or negative trend overall? Could fit a geom_smooth curve). It could well be that Excel is incorrect or using unusual assumptions (e.g. default parameters). – Bowhaven Feb 20 '23 at 15:05
  • 1
    I don't think you can synchronize the random seed across R and Excel. If you want to compare the result, export the split data, or at least the indices. Random splits can definitely lead to different results. – shs Feb 20 '23 at 15:31
  • 1
    Also, some meta-advice: If you don't want to sound like a complete beginner, say you are doing your analysis with R. Rstudio is just the IDE. Also, if you want more helpful responses, you should provide a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – shs Feb 20 '23 at 15:33

0 Answers0