Are outliers affecting the shape of my logistic regression curve or is it the fact the samples are unbalanced?

Question

I am trying to carry out logistic regression on a dataset. I have converted Age (categorical variable) to binary (0 = "Adult", 1 = "Immature"). Tail length is a continuous numerical variable and I want to predict the probability that an animal with a tail length of greater than 220mm is immature.

There is a large difference in the sample size between both ages as shown here:

table(rt$Age)

#  Adult   Immature
120         448

Some code:

rt$bin_age <- rt$Age # Create a separate vector to convert Age to binary

rt$bin_age <- recode(rt$Age, "A" = 0, "I" = 1)

library(ggplot2)
ggplot(rt, aes(x = Tail,
                 y = bin_age)) +
  geom_jitter(color = "blue", 
              size = 3, 
              height = 0.04,
              width = 0.2,
              alpha = 0.5) +
  geom_smooth(method = "loess", size = 1, 
              col = "red", lty = 2, se = FALSE) +
  labs(x = "Tail Length (in mm)", y = "Sex") +
  theme_classic()

When I plot the data using ggplot, I get the following image:

plot of tail length (x) vs sex (y)

Rather than producing a nice outright sigmoidal curve, it produces more of a "sideways S" curve.

I identified three outliers less than 175mm, so I removed them:

# Use 175mm as the cut-off = remove values <175mm

which(rt$Tail < 175) #  Rows 261, 317 and 361

# Remove rows 261, 317 and 361
rt <- rt[-c(261, 317, 361) ,]

and got this image:

plot of tail length (x) vs sex (y)

Has this occurred because of the difference in sample size between the two populations? Is there a way to equal up the sample size (e.g. through looped sub-sampling or something) so I can interpret this more appropriately?

I also ran a visreg() graph with the outliers left in and I'm not sure whether it is more appropriate to use?

age_glm <- glm(bin_age ~ Tail,
               family = binomial(link = "logit"),
               data = rt)

summary(age_glm)

visreg(age_glm, xvar = "Tail", scale = "response", rug = FALSE)
points(jitter(bin_age, 0.2) ~ Tail, ylim = c(-0.1, 1.1),
        data = rt, 
        pch = 20, col = "black", cex = 1, lwd = 1)

Giving me this graph:

    It still looks a bit funky...

Note that I have used a Kruskal-Wallis test to test for differences between the ages with regards to tail length, and P <0.001, so I was expecting a more marked difference in the graphs.

I assume you're using Age and tail length as synonyms here? It's a little confusing. — Ben Bolker, Feb 02 '21 at 17:07
Tail is a continuous numerical variable. I will edit to try and make it clearer. Thanks. — emmz, Feb 02 '21 at 17:14
Or maybe I'm confused because you refer to "Age" (which I now see is binary) but your plots have "Sex" as the y-axis label...??? — Ben Bolker, Feb 02 '21 at 17:15
Yes, that is my fault! Stupid typo - I was recycling the graph from a different plot that works. Ignore the "Sex" on the y-axis, it should say "Age". — emmz, Feb 02 '21 at 17:22
The two types of graphs are doing different things. The first one is using a loess smoother, which doesn't make any functional form assumptions about the shape of the curve. The result is not necessarily a function of imbalance. The `visreg()` function is plotting a parametric model that imposes the sigmoid shape on the relationship. I don't see anything in the plot that is inconsistent with finding a significant difference between the two groups. — DaveArmstrong, Feb 02 '21 at 18:06

Are outliers affecting the shape of my logistic regression curve or is it the fact the samples are unbalanced?

0 Answers0