1

I need to look for correlations in the publicly available flights package. I managed to make a scatter plot using ggplot. enter image description here With the code:

library(nycflights13)
attach(flights)
ggplot(flights, aes(x = arr_delay, y = dep_delay)) + 
  geom_point(size = 2) +
  geom_smooth(method="auto", se=TRUE, fullrange=FALSE, level=0.95)

As show in the image most is centered in the bottom left. Is there any way to make this graph look more visually appealing by spreading the plotted values better?

SomeDutchGuy
  • 2,249
  • 4
  • 16
  • 42
  • 2
    as an aside: [Why is it not advisable to use `attach()` in R, and what should I use instead?](https://stackoverflow.com/questions/10067680/why-is-it-not-advisable-to-use-attach-in-r-and-what-should-i-use-instead) – markus Mar 07 '20 at 11:05
  • ;) the inevitable [attach bashing](https://www.r-bloggers.com/to-attach-or-not-attach-that-is-the-question/) beginneth ... even more so since the example works totally fine without the `attach` line ;) - But on a serious note regarding the actual question: Not sure if it would really **improve** readability, but you could use `ggplot2::scale_x_continuous` `trans` argument to use logarithmic scales for x and y axis... i.e. `+ scale_x_continuous(trans='log10')` – dario Mar 07 '20 at 11:09
  • 1
    @dario Yeah, someone had to bring up that post ;). Regarding the question, the data contains negative values so log won't work. – markus Mar 07 '20 at 11:12
  • 1
    @markus Thanks for the comment, I didn't even check if there are values <= 0 in the data.. log transform won't work for obvious reasons ;) I think your [`geom_hex`](https://ggplot2.tidyverse.org/reference/geom_hex.html) link is broken otherwise interesting and good suggestion – dario Mar 07 '20 at 11:33
  • Since the variables are called "_delay", you could argue that any negative values should be omitted. Then you can take logs (after adding a small number first). As another aside, always label your axes correctly. What are the units? Minutes? Hours? Days? – Edward Mar 07 '20 at 11:43

3 Answers3

3

You can plot your points by using the alpha parameter which gives a degree of transparency (between 0 and 1 being the most opaque) to them. This will make overlapping points distinguish better while also making the regions of the plot with higher concentration look darker. The style of the plot will improve, too.

Start with a value of alpha = 0.7 then experiment with it until you get the best results.

ggplot(flights, aes(x = arr_delay, y = dep_delay)) + 
  geom_point(size = 2, alpha = 0.7) +
  geom_smooth(method="auto", se=TRUE, fullrange=FALSE, level=0.95)
user2332849
  • 1,421
  • 1
  • 9
  • 12
3

I used facets to separate the flights that arrived early or on time (arr_delay <=0) with those that arrived late (arr_delay>0). The relationship seems different.

library(nycflights13)
library(dplyr)
library(ggplot2)

ff <- flights %>%
  filter(!is.na(arr_delay), origin=="LGA") %>%  # Filtered to reduce waiting time!
  mutate(`Arrival time`=ifelse(arr_delay<=0, "Early", "Delayed"))

ggplot(ff, aes(x = arr_delay, y = dep_delay)) + 
       geom_point(size = 2, alpha = 0.3) +
       geom_smooth(method="auto", fullrange=FALSE, level=0.95) + 
       facet_wrap(~`Arrival time`, scales="free", labeller=label_both) +
       labs(x="Arrival delay (minutes)", y="Departure delay (minutes)")

enter image description here

Edward
  • 10,360
  • 2
  • 11
  • 26
1

For the points, you could use aggregated data, for the smooth the normal data.

flights <- within(flights, {
  bin <- floor(dep_delay / 10)
  av_arr <- ave(arr_delay, bin, FUN=mean)
  av_dep <- ave(dep_delay, bin, FUN=mean)
})

library("ggplot2")
library("nycflights13")
ggplot(flights) + 
  geom_point(aes(x=av_arr, y=av_dep), size=2) +
  geom_smooth(aes(x=arr_delay, y=dep_delay), method="auto", se=TRUE, 
              fullrange=FALSE, level=0.95)

enter image description here

jay.sf
  • 60,139
  • 8
  • 53
  • 110