11

My purpose is to reproduce this figure (Douglas A. Lind, William G Marchal, Samuel A. Wathen, Statistical Techniques in Business and Economics, McGraw-Hill, 17th edition) with ggplot2 (author: Hadley Wickham).

enter image description here

Here is my effort based on geom_point and some ugly data preparation (see code further down):

enter image description here

How could I do that with geom_dotplot()?

In my attempts I have encountered several problems: (1) map the default density produced by geom_dotplot to a count, (2) cut off the axis, (3) not have unexpected holes. I gave up and hacked geom_point() instead.

I expected (and still hope) it would be as simple as

ggplot(data, aes(x,y)) + geom_dotplot(stat = "identity")

but no. So here's what I've tried and the output:

# Data
df <- structure(list(x = c(79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105), y = c(1, 0, 0, 2, 1, 2, 7, 3, 7, 9, 11, 12, 15, 8, 10, 13, 11, 8, 9, 2, 3, 2, 1, 3, 0, 1, 1)), class = "data.frame", row.names = c(NA, -27L))

# dotplot based on geom_dotplot
geom_dots <- function(x, count, round = 10, breaks = NULL, ...) {
    require(ggplot2)
    n = sum(count) # total number of dots to be drawn
    b = round*round(n/round) # prettify breaks
    x = rep(x, count) # make x coordinates for dots
    if (is.null(breaks))  breaks = seq(0, 1, b/4/n)
    ggplot(data.frame(x = x), aes(x = x)) +
        geom_dotplot(method = "histodot", ...) +
        scale_y_continuous(breaks = breaks, 
                        #limits = c(0, max(count)+1), # doesn't work
                        labels = breaks * n) 
} 

geom_dots(x = df$x, count = df$y) 

# dotplot based on geom_point
ggplot_dot <- function(x, count, ...) {
    require(ggplot2)
    message("The count variable must be an integer")
    count = as.integer(count) # make sure these are counts
    n = sum(count) # total number of dots to be drawn
    x = rep(x, count) # make x coordinates for dots
    count = count[count > 0]  # drop zero cases 
    y = integer(0)  # initialize y coordinates for dots
    for (i in seq_along(count)) 
        y <- c(y, 1:(count[i]))  # compute y coordinates
    ggplot(data.frame(x = x, y = y), aes(x = x, y = y)) +
        geom_point(...)  # draw one dot per positive count
}

ggplot_dot(x = df$x, count = df$y, 
    size = 11, shape = 21, fill = "orange", color = "black") + theme_gray(base_size = 18)
# ggsave("dotplot.png") 
ggsave("dotplot.png", width = 12, height = 5.9)

Brief random comment: With the geom_point() solution, saving the plot involves tweaking the sizes just right to ensure that the dots are in contact (both the dot size and the plot height/width). With the geom_dotplot() solution, I rounded the labels to make them prettier. Unfortunately I was not able to cut off the axis at about 100: using limits() or coord_cartesian() results in a rescaling of the entire plot and not a cut. Note also that to use geom_dotplot() I created a vector of data based on the count, as I was unable to use the count variable directly (I expected stat="identity" to do that, but I couldn't make it work).

enter image description here

PatrickT
  • 10,037
  • 9
  • 76
  • 111
  • 3
    I wonder how much work it would be to extend ``stat_count`` and add a ``geom="dot"`` to print circles instead of rectangles... One could then imagine stacking squares, triangles, images... – PatrickT Dec 10 '18 at 05:58

3 Answers3

7

Coincidentally, I've also spent the past day fighting with geom_dotplot() and trying to make it show a count. I haven't figured out a way to make the y axis show actual numbers, but I have found a way to truncate the y axis. As you mentioned, coord_cartesian() and limits don't work, but coord_fixed() does, since it enforces a ratio of x:y units:

library(tidyverse)
df <- structure(list(x = c(79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105), y = c(1, 0, 0, 2, 1, 2, 7, 3, 7, 9, 11, 12, 15, 8, 10, 13, 11, 8, 9, 2, 3, 2, 1, 3, 0, 1, 1)), class = "data.frame", row.names = c(NA, -27L))
df <- tidyr::uncount(df, y) 

ggplot(df, aes(x)) +
  geom_dotplot(method = 'histodot', binwidth = 1) +
  scale_y_continuous(NULL, breaks = NULL) + 
  # Make this as high as the tallest column
  coord_fixed(ratio = 15)

Using 15 as the ratio here works because the x-axis is also in the same units (i.e. single integers). If the x-axis is a percentage or log dollars or date or whatever, you have to tinker with the ratio until the y-axis is truncated enough.


Edited with method for combining plots

As I mentioned in a comment below, using patchwork to combine plots with coord_fixed() doesn't work well. However, if you manually set the heights (or widths) of the combined plots to the same values as the ratio in coord_fixed() and ensure that each plot has the same x axis, you can get psuedo-faceted plots

# Make a subset of df
df2 <- df %>% slice(1:25)

plot1 <- ggplot(df, aes(x)) +
  geom_dotplot(method = 'histodot', binwidth = 1) +
  scale_y_continuous(NULL, breaks = NULL) + 
  # Make this as high as the tallest column
  # Make xlim the same on both plots
  coord_fixed(ratio = 15, xlim = c(75, 110))

plot2 <- ggplot(df2, aes(x)) +
  geom_dotplot(method = 'histodot', binwidth = 1) +
  scale_y_continuous(NULL, breaks = NULL) + 
  coord_fixed(ratio = 7, xlim = c(75, 110))

# Combine both plots in a single column, with each sized incorrectly
library(patchwork)
plot1 + plot2 +
  plot_layout(ncol = 1)

# Combine both plots in a single column, with each sized appropriately
library(patchwork)
plot1 + plot2 +
  plot_layout(ncol = 1, heights = c(15, 7) / (15 + 7))

Andrew
  • 36,541
  • 13
  • 67
  • 93
  • The downside to this is that it doesn't work well with things like patchwork or gridExtra, if you want to combine multiple plots. It also definitely doesn't work with faceting, since each facet would technically need its own ratio – Andrew Dec 11 '18 at 18:11
  • 1
    Very nice, thanks Andrew. I hadn't thought of ``coord_fixed()``, which I've used before for other purposes. That's a great one to know. Together with Nate's ``binwidth = 1`` your ``coord_fixed(ratio = 15)`` answers my original question. As for the labels, well they are redundant and if really desired a hack of ``geom_point`` like I suggested can be made to work. Difficult to choose an answer as they are both equally relevant. :-) – PatrickT Dec 11 '18 at 20:32
  • 1
    Just edited the answer to show how to combine multiple `coord_fixed()`-based plots together using appropriate sizing – Andrew Dec 14 '18 at 22:13
5

You can mimic the geom_dotplot with another geom - I chose ggforce::geom_ellipse for full size control of your points. It shows the count on the y axis. I have added some lines to make it more programmatic - and tried to reproduce the OP's desired graphic. This thread is related to this question, where the aim was to create animated histograms with dots.

This is the final result: (Code see below)

How to get there: First some necessary data modifications

library(tidyverse)
library(ggforce)

df <- structure(list(x = c(79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105), y = c(1, 0, 0, 2, 1, 2, 7, 3, 7, 9, 11, 12, 15, 8, 10, 13, 11, 8, 9, 2, 3, 2, 1, 3, 0, 1, 1)), class = "data.frame", row.names = c(NA, -27L))

bin_width <- 1
pt_width <- bin_width / 3 # so that they don't touch horizontally
pt_height <- bin_width / 2 # 2 so that they will touch vertically

count_data <- 
  data.frame(x = rep(df$x, df$y)) %>%
  mutate(x = plyr::round_any(x, bin_width)) %>%
  group_by(x) %>%
  mutate(y = seq_along(x))

ggplot(count_data) +
  geom_ellipse(aes(
    x0 = x,
    y0 = y,
    a = pt_width / bin_width,
    b = pt_height / bin_width,
    angle = 0
  )) +
  coord_equal((1 / pt_height) * pt_width)# to make the dot

Setting bin width is flexible!

bin_width <- 2 
# etc (same code as above)

Now, it was actually quite fun to reproduce the Lind-Marchal-Wathen graphic a bit more in detail. A lot of it is not possible without some hack. Most notably the "cross" axis ticks and of course the background gradient (Baptiste helped).

library(tidyverse)
library(grid)
library(ggforce)

p <- 
  ggplot(count_data) +
    annotate(x= seq(80,104,4), y = -Inf, geom = 'text', label = '|') +
  geom_ellipse(aes(
    x0 = x,
    y0 = y,
    a = pt_width / bin_width,
    b = pt_height / bin_width,
    angle = 0
  ),
  fill = "#E67D62",
  size = 0
  ) +
    scale_x_continuous(breaks = seq(80,104,4)) +
    scale_y_continuous(expand = c(0,0.1)) +
  theme_void() +
  theme(axis.line.x = element_line(color = "black"),
        axis.text.x = element_text(color = "black", 
                                   margin = margin(8,0,0,0, unit = 'pt'))) +
  coord_equal((1 / pt_height) * pt_width, clip = 'off')

oranges <- c("#FEEAA9", "#FFFBE1")
g <- rasterGrob(oranges, width = unit(1, "npc"), height = unit(0.7, "npc"), interpolate = TRUE)

grid.newpage()
grid.draw(g)
print(p, newpage = FALSE)

Created on 2020-05-01 by the reprex package (v0.3.0)

tjebo
  • 21,977
  • 7
  • 58
  • 94
3

Is this close enough for the reproduction?

enter image description here

To get there, since the first plot is really a histogram, expand your example data back out into one row per observation form, from the count summaries.

df <- tidyr::uncount(df, y)  

Then using method = 'histodot' and bindwidth=1 to get geom_dotplot() into it's histogram-y form.

And removing the y-axis for aesthetic, because it's fractional gibberish and even the docs say it "isn't really meaningful, so hide it".

ggplot(df, aes(x)) +
  geom_dotplot(method = 'histodot', binwidth = 1) +
  scale_y_continuous(NULL, breaks = NULL)
Nate
  • 10,361
  • 3
  • 33
  • 40
  • Thanks Nate. You are correct that this dotplot is a barchart of the count with circles instead of rectangles (in this example it's a simple barchart, but this could be generalized to a histogram). But the y axis is not gibberish, it's intended to be a count of the data. However, I do agree that the labels are redundant (unlike in the barchart where they are informative). Do you have a trick to get the labels right? Thanks. – PatrickT Dec 10 '18 at 05:44
  • One thing that's missing from your solution is cutting off the y axis at the top. That and the labels are the main reason I used a ``geom_point()``. Do you know how to truncate the excess axis (using ``geom_dotplot()`` and without accessing the grobs, I should add)? – PatrickT Dec 10 '18 at 05:48
  • 1
    Here is a ``geom_dotplot()`` trick I was missing: the option ``binwidth = 1`` fixes the weird gap I had. Here is a ``tidyr`` trick I learned (and this is a transferable skill, so another big thank you!) is ``uncount(df, y)``. Neat. – PatrickT Dec 10 '18 at 05:50
  • To clarify, in terms of the points listed in my OP, your solution solves (3), with ``binwidth=1``, but leaves (1) and (2) unanswered. – PatrickT Dec 10 '18 at 06:00
  • I hope you didn't think I was refering to your axis idea as gibberish, but instead the axis generated by `geom_dotplot`. Since the dotplot is using a geom with fixed size, getting your axis limits to sync with counts will always be graphics device dependent (like you have already encountered). I don't have any other tricks to get around that other than dancing with the dimensions of the plot ouput to line up the numbers with something like `+ scale_y_continuous(breaks = seq(0, 1, by = 1/max(df$y)), labels = 0:max(df$y))` – Nate Dec 11 '18 at 13:50
  • Thanks for getting back Nate. The original dotplot does not actually label the axis! Labels are kind of useless, you're right. Your suggestion for tweaking the axis labels does not work (the one in the comment above, it gives approx 4 instead of 7, 8.5 instead of 15), but a variant of it might. But how about clipping off the top of the axis (above 15), any idea? I think we can accept the principle that labels are useless, but having half of the figure display for nothing is harder to justify... – PatrickT Dec 11 '18 at 14:54