0

This is the result that I get when running the confusion matrix:

      TRUE
  0     47
  1 231194

Want something like this:

      0 1
  0  1000 47
  1  50  3000

I am not sure what I am doing wrong here and why I am receiving this type of response when running the confusion matrix. Please help. I tried reclassifying the variables to see if that would help. I am also wondering if the variables I have chosen are not appropriate for the problem and causing this issue.

Traffic_Reduced <- Traffic_Reduced[,c("Date.Of.Stop","Fatal","Time.Of.Stop","SubAgency","Belts","Commercial.License","HAZMAT",
                                      "Commercial.Vehicle","Alcohol","Work.Zone","State","VehicleType",                 
                                      "Make", "Color","Violation.Type", "Race","Gender","Driver.City",
                                      "Driver.State","DL.State")]

Traffic_Reduced$Time.Of.Stop <- as.numeric(Traffic_Reduced$Time.Of.Stop)

Traffic_Reduced$Time.Of.Stop = ifelse(is.na(Traffic_Reduced$Time.Of.Stop),
                                      ave(Traffic_Reduced$Time.Of.Stop, FUN = function(x) mean(x, na.rm = TRUE)),
                                      Traffic_Reduced$Time.Of.Stop)

# Classification - Logistic Regression
# Fatal (Y/N) and Time.of.Stop

classification <- Traffic_Reduced[,c("Date.Of.Stop","Time.Of.Stop","Fatal")]

classification$Time.Of.Stop <- as.numeric(classification$Time.Of.Stop)
classification$Date.Of.Stop <- as.numeric(classification$Date.Of.Stop)

classification$Fatal = factor(classification$Fatal,
                              labels = c(0,1)
                              )
set.seed(100)
split = sample.split(classification$Fatal, SplitRatio = 0.7)
training_set = subset(classification, split == TRUE)
test_set = subset(classification, split == FALSE)

training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])

classifier = glm(formula = Fatal ~ .,
                 family = binomial(link="logit"),
                 data = training_set)

predicted <- plogis(predict(classifier, test_set))

predicted <- predict(classifier, test_set, type = "response")

y_pred = ifelse(predicted > 0.5, 1, 0)

prob_pred = predict(classifier, type = 'response', newdata = test_set[-3])
y_pred = ifelse(prob_pred > 0.5, 1, 0)
cm = table(test_set[, 3], y_pred > 0.5)

Sample of Traffic_Reduced dataset:

  Date.Of.Stop Fatal Time.Of.Stop                                       SubAgency Belts Commercial.License
1   09/24/2013    No     17:11:00                     3rd district, Silver Spring    No                 No
2   08/29/2017    No     10:19:00                          2nd district, Bethesda    No                 No
3   12/01/2014    No     12:52:00 6th district, Gaithersburg / Montgomery Village    No                 No
4   08/29/2017    No     09:22:00                     3rd district, Silver Spring    No                 No
5   08/28/2017    No     23:41:00 6th district, Gaithersburg / Montgomery Village    No                 No
6   08/27/2013    No     00:55:00                          2nd district, Bethesda    No                 No
  HAZMAT Commercial.Vehicle Alcohol Work.Zone State     VehicleType        Make  Color Violation.Type
1     No                 No      No        No    MD 02 - Automobile        FORD  BLACK       Citation
2     No                 No      No        No    VA 02 - Automobile      TOYOTA  GREEN       Citation
3     No                 No      No        No    MD 02 - Automobile       HONDA SILVER       Citation
4     No                 No      No        No    MD 02 - Automobile        DODG  WHITE       Citation
5     No                 No      No        No    MD 02 - Automobile MINI COOPER  WHITE       Citation
6     No                 No      No        No    MD 02 - Automobile     HYUNDAI   GRAY       Citation
   Race Gender     Driver.City Driver.State DL.State
1 BLACK      M     TAKOMA PARK           MD       MD
2 WHITE      F FAIRFAX STATION           VA       VA
3 BLACK      F  UPPER MARLBORO           MD       MD
4 BLACK      M FORT WASHINGTON           MD       MD
5 WHITE      M    GAITHERSBURG           MD       MD
6 WHITE      F   SILVER SPRING           MD       MD

Sample of classification dataset - I converted the timestamp / date to numeric values:

Date.Of.Stop Time.Of.Stop Fatal
1         1582         1032     0
2         1447          620     0
3         1923          773     0
4         1447          563     0
5         1441         1422     0
6         1431           56     0
draj01100
  • 13
  • 5
  • 1
    When asking for help, be sure to include a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input data. The error likely has to do with using "$" in an `aes()` which is a big no-no but hard to say for sure without being able to run and test the code. – MrFlick Sep 08 '17 at 15:11
  • Is the first table the result of `cm = table(test_set[, 3], y_pred > 0.5)` ? If so, you have a huge imbalance in case / non-case, so it is unsurprising that 0.5 threshold doesn't pick up any non-cases. You will likely need to increase the threshold for your predictions (ie `y_pred > 0.99`) to capture any non-case predictions. – user20650 Sep 08 '17 at 18:33
  • ps: does your training_set data have the same numbers of case / non-case? Are you using all the predictors that you show? If so, you are unlikely to have enough info to support that - a rule of thumb iirc (see Harrell's RMS) is around 10-20 events per variable (the event in this case is non-case, as its the minimum of case/non-case). Also see Firth's logistic regression (`logistf` package) which can produce better (less biased) results for rare events – user20650 Sep 08 '17 at 18:56

0 Answers0