0

I have a dataset with a binary target (good clients vs. bad clients). For each client, I have a row with many variables (~150).

I wish to do the following:

  1. Build a prediction of bad clients
  2. Calculate a score of how bad a client is.

I wanted to use random forests for prediction, and logistic regression for the score (probability of being bad, which give a score between 0 and 1).

I have these problems:

  1. Random forests don't support missing values. I do know, technically, how to tell R to impute or omit the missing values (I get an error message when using the package randomforest).
  2. In logistic regression, how to obtain the score for each subject (probability of being a bad client.
  3. In general, if I want to fit a model in R, like in the randomforest package, and I need a syntax like: Y~X1+X2+..., how can I tell R to include in the model all variables X1 to X150 ?

My data looks like this: A variable 'Client' which is 0 or 1, and X1-X150 independent variables, some are factors, some are numeric.

iled
  • 2,142
  • 3
  • 31
  • 43
user3275222
  • 225
  • 3
  • 12
  • 1
    Please include a http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example reproducible example. It doesnt have to be your entire dataset, just part of it – Rilcon42 Jun 12 '16 at 11:07

1 Answers1

2
  1. randomForest function can set na.omit to impute all missing values:
model1 = randomForest(Species ~ . , data=iris, na.action=na.omit)
  1. The score can be obtained by prediction of models.

  2. X1 to X150 can be represented by .:

glm.client = glm(Client ~ . , family=gaussian, data=training_data)
score.client = predict(glm.client, testing_data)
iled
  • 2,142
  • 3
  • 31
  • 43
Alexc
  • 41
  • 3