1

I can plot geom_freqpoly without problems using the number of observation

ggplot(data=demo) +
 geom_freqpoly(mapping=aes(x = value))

enter image description here

But I'd like to use the precalculated obeservation count contained in the data.

I tried using stat = "identity" but it apparently doesn't work.

ggplot(data=demo) +
 geom_freqpoly(mapping=aes(x = value, y = cnt), stat = "identity")

enter image description here

This is my sample data

demo  <- tribble(
 ~value,    ~cnt,
 .25, 20,
 .25, 30,
 .1, 40
)
Marmite Bomber
  • 19,886
  • 4
  • 26
  • 53

2 Answers2

2

TL;DR: You didn't get the graph you want, because the data of pre-calculated counts you passed to ggplot was NOTHING like what was used to produce the freqpoly graph.

Since you didn't include code for the original demo used to generate graph 1, I'll venture a guess:

demo.orig <- data.frame(value = c(0.25, 0.25, 0.1))

p <- ggplot(demo.orig, aes(x = value)) +
  geom_freqpoly()
p # show plot to verify its appearance, which matches the graph in the question
layer_data(p) # look at the calculated data used by geom_freqpoly

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
   y count          x       xmin       xmax       width   density ncount ndensity PANEL group colour size linetype alpha
1  0     0 0.09310345 0.09051724 0.09568966 0.005172414   0.00000    0.0      0.0     1    -1  black  0.5        1    NA
2  1     1 0.09827586 0.09568966 0.10086207 0.005172414  64.44444    0.5      0.5     1    -1  black  0.5        1    NA
3  0     0 0.10344828 0.10086207 0.10603448 0.005172414   0.00000    0.0      0.0     1    -1  black  0.5        1    NA
... (omitted to conserve space)
30 0     0 0.24310345 0.24051724 0.24568966 0.005172414   0.00000    0.0      0.0     1    -1  black  0.5        1    NA
31 2     2 0.24827586 0.24568966 0.25086207 0.005172414 128.88889    1.0      1.0     1    -1  black  0.5        1    NA
32 0     0 0.25344828 0.25086207 0.25603448 0.005172414   0.00000    0.0      0.0     1    -1  black  0.5        1    NA

From a small dataframe with only two unique values, stat_bin generated a much larger dataframe with the x-axis split into 30 bins (the default number), and count / y = 0 everywhere except for the two bins containing the original values.

> geom_freqpoly
function (mapping = NULL, data = NULL, stat = "bin", position = "identity", 
    ..., na.rm = FALSE, show.legend = NA, inherit.aes = TRUE) 
{
    params <- list(na.rm = na.rm, ...)
    if (identical(stat, "bin")) {
        params$pad <- TRUE
    }
    layer(data = data, mapping = mapping, stat = stat, geom = GeomPath, 
        position = position, show.legend = show.legend, inherit.aes = inherit.aes, 
        params = params)
}

A quick check by printing geom_freqpoly to console shows that its underlying geom is simply GeomPath, which plots x/y pairs in sequential order.

In other words, if you want to get the peaks from graph 1, you need to provide a similar dataset, with rows indicating where y should drop to 0. While it's certainly possible to calculate this by digging into the code for StatBin$compute_group, I'd think it's simpler to expand from the data of pre-calculated counts and let ggplot do its normal job:

demo %>%
  tidyr::uncount(cnt) %>%
  ggplot(aes(x = value)) + 
  geom_freqpoly() +
  theme_minimal()

plot


Edit: solution without fully expanding dataframe of aggregated counts

Sample dataset with 2 groups:

demo <- data.frame(value = c(0.25, 0.5, 0.1, 0.25, 0.75, 0.1),
                   cnt = c(5, 2, 4, 3, 8, 7) * 10e8,
                   group = rep(c("a", "b"), each = 3))

Code:

library(ggplot2)
library(dplyr)

demo %>%
  rename(x = value, y = cnt) %>% # rename here so approach below can be easily applied
                                 # to other datasets with different column names
  tidyr::nest(data = c(x, y)) %>% # nest to apply same approach for each group

  mutate(data = purrr::map(
    data,
    function(d) ggplot2:::bin_vector( # cut x's range into appropriate bins
      x = d$x,
      bins = ggplot2:::bin_breaks_bins(
        x_range = range(d$x),
        bins = 30), # default bin count is 30; change if desired
      pad = TRUE) %>%
      select(x, xmin, xmax) %>%

      # place y counts into the corresponding x bins (this is probably similar
      # to interval join, but I don't have that package installed on my machine)
      tidyr::crossing(d %>% rename(x2 = x)) %>%
      mutate(y = ifelse(x2 >= xmin & x2 < xmax, y, 0)) %>%
      select(-x2) %>%
      group_by(x) %>%
      filter(y == max(y)) %>%
      ungroup() %>%
      unique())) %>%

  tidyr::unnest(cols = c(data)) %>% # unnest to get one flat dataframe back

  ggplot(aes(x = x, y = y, colour = group)) + # plot as per normal
  geom_path() +
  theme_bw()

# package versions used: dplyr 1.0.0, ggplot2 3.3.1, tidyr 1.1.0, purrr 0.3.4

result

Z.Lin
  • 28,055
  • 6
  • 54
  • 94
  • This provides an expected result, but basically by transforming the aggregated data back to its single observation level (which I intended to avoid). Anyway this answer deserves an upvote not least because the mention of `layer_data` (which I didn't know and had no idea how could I find it). Wouldn't you mind to provide some information on the `compute_group`, so I can accept the answer? Please re run the sample data with a count of `40M` instead of `40` to see what I mean. – Marmite Bomber Jul 19 '20 at 09:01
  • Sorry for the late response, I somehow didn't receive the notice about the update. *Many thanks for the detailed explanation* with the sample code. I think the mention of `ggplot2:::bin_vector` is crucial because of the parameter `weight`; which led me to the solution. Wouldn't you mind you to review my alternative answer, so I can accept it if it would be found complete. – Marmite Bomber Aug 03 '20 at 13:30
0

Based on the similar problem for histograms the solution seems to be as simple as to use the weight parameter in the aesthetics.

The solution using the sample data from the other answer would be

demo <- data.frame(value = c(0.25, 0.5, 0.1, 0.25, 0.75, 0.1),
                   cnt = c(5, 2, 4, 3, 8, 7) * 10e8,
                   group = rep(c("a", "b"), each = 3))


ggplot(demo, aes(value, weight = cnt, color = group)) + geom_freqpoly()  

enter image description here

Marmite Bomber
  • 19,886
  • 4
  • 26
  • 53