2

Background: Point-biserial correlation is used to measure the relationship between a binary variable, x, and a continuous variable, y.

Methods: I use the cor.test() function to calculate R and p-value:

# the two vectors
x <- mtcars$am
y <- mtcars$mpg

#calculate point-biserial correlation
cor_result <- cor.test(x, y)
cor_result$p.value
cor_result$estimate

The I use ggplot2 to plot it this way, the numbers within the points denote for cylinder:

library(see) # theme_modern()
library(dplyr)
library(ggplot2)


# plot
mtcars %>% 
  mutate(am = factor(am)) %>% 
  mutate(id = row_number()) %>% 
  ggplot(aes(x=id, y=mpg, color=am, label = cyl )) +
  geom_point(size = 8, alpha=0.5)+
  geom_text(color = "black", hjust=0.5, vjust=0.5)+
  scale_color_manual(values = c("steelblue", "purple"), labels = c("No", "Yes"))+
  scale_x_continuous(breaks = 1:32, labels = 1:32)+
  scale_y_continuous(breaks= scales::pretty_breaks())+
  geom_text(aes(x = 10, y = 30,
                label = ifelse(am == 0, "R = 0.5998324, p = 0.0002850207", "")),
            color = "black",
            size = 4) +
  facet_wrap(. ~ am, 
             nrow = 1, strip.position = "bottom") +
  labs(y = "mpg", 
       color="Automatic vs Manual transmission")+
  theme_modern()+
  theme(
    aspect.ratio = 2,
    strip.background = element_blank(),
    strip.placement = "outside",
    legend.position = "bottom",
    axis.title.x=element_blank(),
    axis.text.x=element_blank(),
    axis.ticks.x=element_blank(),
    text=element_text(size=16)
  )

enter image description here

My question Would you consider this as an appropriate figure to show the correlation of am and mpg. Could you give me a hint to improve this plot.

TarJae
  • 72,363
  • 6
  • 19
  • 66
  • 1
    Why do you use different colors for `am` values if you already make facets based on `am`? (or vice-versa) – bretauv Apr 25 '23 at 07:48

1 Answers1

1

I don't like your plot because the jitter in the x-direction brings no useful information (the points are ordered by row id).

You've added a third dimension to the plot (number of cylinders cyl in addition to transmission am and miles per gallon mpg). I'll ignore this third dimension because the question asks how to show the association between am and mpg.

Since am takes only two values (0 = automatic, 1 = manual), this boils down to visually comparing two groups. With larger sample sizes my default choice for this kind of comparison are overlapped histograms (example).

Here here histograms don't work well because there are only a few observations per transmission group. In this case I prefer a stacked strip chart.

attach(mtcars)

stripchart(mpg ~ am,
  method = "stack",
  main = "Automatic (0) vs Manual (1) transmission"
)

If you'd like more color in your strip charts, another option is a beeswarm plot.

This type of graph arranges the data so that each point is visible but it doesn't jitter them randomly; the exact positions are calculated so that the points don't overlap yet are packed close. (There are various algorithms to this purpose.)

In this case the difference between the strip chart and beeswarm plot is hard to notice as there as so few points to plot. For fun I've colored the points according to cyl (cylinders).

library("beeswarm")

beeswarm(mpg ~ am,
  pch = 15,
  pwcol = cyl,
  main = "Automatic (0) vs Manual (1) transmission",
  horizontal = TRUE
)

legend("topright",
  title = "cylinders",
  legend = c(8, 6, 4),
  col = c("#9E9E9E", "#CD0BBC", "#2297E6"),
  pch = 15
)

Created on 2023-04-25 with reprex v2.0.2

dipetkov
  • 3,380
  • 1
  • 11
  • 19