Using multiple summary statistics in a ggplot2 plot

Question

I'm analysing some house sale transaction data, and I want to produce a geographic plot with the colour indicating average price per (hex-binned) region. Some regions have limited data, and I want to indicate this by adjusting the opacity to reflect the number of points in each region.

This would require me to calculate two statistics for each hex: average price and number of points. The ggplot2 package makes it very easy to calculate and plot one statistic in a chart, but I can't figure out how to calculate two.

To illustrate the point:

library(ggplot2)

N = 1000;
df_demo = data.frame(A=runif(N), B=runif(N), C=runif(N)) # dummy data

# I want to produce a hex-binned version of this:
ggplot(data=df_demo) + geom_point(mapping=aes(x=A, y=B, color=C)) 

# It's easy to get each hex's average price *or* its point density:
ggplot(data=df_demo) + stat_summary_hex(mapping=aes(x=A,y=B,z=C), fun=mean) # color = average of C across hex, but opacity can't be adjusted
ggplot(data=df_demo) + geom_hex(mapping=aes(x=A, y=B, color=C, alpha=..ndensity..)) # opacity = normalised # of points, but color is *total* value which is wrong

I would like to combine the effects of the last two lines, but that doesn't seem to be an option: the ..ndensity.. statistic doesn't work in the context of stat_summary_hex(), and geom_hex() won't calculate the mean value.

Is there a way to do this that I'm overlooking? Alternatively, is there an obvious way of precomputing the statistics needed before constructing the plot? E.g. by determining the expected hex for each datum during my dplyr pipeline.

One hint that there may not be an easy solution is this non-CRAN package which - if I've understood correctly - solves more or less this problem. However, I'd rather not rely on out-of-CRAN code if at all possible, so I'm holding onto hope that I've missed something obvious.

score 0 · Accepted Answer · answered Feb 13 '20 at 23:53

0

What about a different geom? E.g. geom_tile - you can create cuts for each dimension (A/B) and then pre-calculate mean and number for each tile and then plot like this:

library(tidyverse)

N = 1000;
df_demo = data.frame(A=runif(N), B=runif(N), C=runif(N)) %>%
  mutate(cuts_a= cut(A, breaks = 20), cuts_b= cut(B, breaks = 20)) %>%
  group_by(cuts_a, cuts_b) %>% mutate(mean_c = mean(C), n_obs = n())

# I want to produce a hex-binned version of this:
ggplot(data=df_demo) + 
  geom_tile(mapping=aes(x=cuts_a, y=cuts_b, fill=mean_c, alpha = n_obs))

^{Created on 2020-02-13 by the reprex package (v0.3.0)}

answered Feb 13 '20 at 23:53

tjebo

21,977
7
58
94

Thanks for this. Not ideal for me since I am very much looking at geographical data, which tends to be more radial around population centres - there's a reason why the Civilisation games use a hex grid! However, it's definitely the best response I've received to date (heh), so I'm going to mark this as the recommended solution. – Alex Apr 08 '22 at 14:18
@Alex there was recently a very similar question, (At least I believe it's similar, I haven't very precisely re-read your question) https://stackoverflow.com/questions/71041500/2d-summary-plot-with-counts-as-labels/71165587#71165587 maybe it can help with that problem... – tjebo Apr 09 '22 at 12:12

Using multiple summary statistics in a ggplot2 plot

1 Answers1