0

I try to run a regression using the glm function, however I keer getting the same error message: "variable lengths differ (found for 'data')". I can't see how my data does not have the same length as I use a sample of 1000 for both my dependent and independent variables. The reason I take a sample of my total data is because I have more than a million observations and I want to see if the model works properly. (running it with all the data takes a very long time) This is the code I use:

sample = sample(1:nrow(agg), 1000, replace = FALSE)
y=agg$TO_DEFAULT_IN_12M_INDICATOR[sample]

test <- glm(as.factor(y) ~., data = as.factor(agg[sample,]), family = binomial)
#coef(full.model)

Here agg contains all my data, and my y is an indicator function of 0's and 1's. Does anyone know how I could fix this problem?

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • It's easier to help you if you provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input that can be used to test and verify possible solutions. Your `data=` parameter should be a data.frame but you seem to be passing in a factor? Not sure what the `as.factor` is supposed to be doing there. You don't need to subset your variables in your formula, just subset the data in `data=`. Use the variable names from the data.frame columns in the formula to keep everything in sync. – MrFlick May 12 '22 at 14:45
  • I am new to this forum so I find it hard to explain my problem properly, sorry for that. But if I leave out as.factor I get the following error: "Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels". – Ramsy.dhifallah May 12 '22 at 14:52
  • Well, that means there's an issue with your data. But since we have no idea what your data looks like, it's difficult to provide any sort of specific help. – MrFlick May 12 '22 at 14:53
  • My data exists out of 135 variables, both of the numeric as well as the character type. They include things like "type of mortgage" and if a clients has ever bean in arrear. – Ramsy.dhifallah May 12 '22 at 14:58
  • 1
    In `as.factor(agg[sample,])` you attempt to coerce the entire data frame as factor, that won't happen. Try to replace with `as.data.frame(lapply(agg[, sample], as.factor))`. – jay.sf May 12 '22 at 15:59

0 Answers0