1

I'm sure this can be done by separately collecting all the data and then just using ggplot for the plotting, but I'd really prefer a simpler solution implementing ggplot, particulalry stat_ecdf() because of easier access to grouping variables, facets, etc.

My dataframe contains, amongst others, two columns of corresponding data x and y. I'd like to plot the ecdf of y on an axis of the corresponding x values. In other words, I'd like to plot what cumulative portion of the y variable is reached at its corresponding x value. While x and y are correlated (both descending), they are not analytically connected, so I cannot simply scale values of y to x. My attempts to do this with separate calculations of the ecdf functions of each subset have gotten extremely messy and complicated, while the stat_ecdf function seems to be very close to getting me what I need.

If I set the x variable in the ggplot aes to x and then set the variable within stat_ecdf to y, I am able to get the ecdf of y with axis labels of x; however, the actual values on the axis correspond to x. I'm plotting This is done with something like:

ggplot(df, aes(x, color=group_var)) + stat_ecdf(aes(y))

EDIT: To visualize this: This sample plot shows the ecdf of x for multiple groups. Each x value has a corresponding y value in a sorted dataframe (approximate relationship, ignore the decreasing regions at the end. I would like to have a similar plot where the horizontal axis is in the corresponding y values. Basically, I need to map the horizontal axis of the first ecdf plot from x->y as simply as possible. I could do this manually by adding ecdf values as a column in the dataframe, but I am looking to do it within ggplot for simplicity, if possible.

  • I'm not sure I understand what you want to achieve? ecdf is by definition calculated from a single variable. If you can describe the desired transformation clearly, it will be probably easy to deal with the 'separate calculations' here.. are you trying to plot `x` versus the `quantile` of `y`? – liborm Aug 12 '22 at 21:08
  • I'm not trying to plot against the quantile. I want to plot the ecdf of one variable on an axis of corresponding values of another. For example, if I have columns of corresponding height and weight, with both guaranteed to be descending, I would like to plot the ecdf of weight on an axis of height (to see what portion of the total weight is made up by people of a certain height or below. – rockets_go_boom Aug 12 '22 at 21:22
  • Edited to give example – rockets_go_boom Aug 12 '22 at 21:33
  • Welcome to SO! It would be easier to help you if you provide [a minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) including a snippet of your data or some fake data. This said: One issue with your code is that you should do `stat_ecdf(aes(y = y))` to get the ecdf of y versus x. – stefan Aug 12 '22 at 21:38
  • Please provide enough code so others can better understand or reproduce the problem. – Community Aug 13 '22 at 00:12

1 Answers1

0

Instead of trying to bend stat_ecdf to do something it was not designed for, it's better to be explicit about your intention in the code.

It's quite straightforward. The most weird piece of code: ecdf(y)(y) menas 'calculate the empirical CDF for y, and then evaluate it for the actual values of y in my data. The cummax deals with the decreasing y, to get ever increasing eCDF along x.

d_sample %>%
  group_by(group) %>%
  arrange(group, x) %>%
  mutate(
    fraction = ecdf(y)(y),
    maxf = pmax(fraction, cummax(fraction))) %>%
  ggplot(aes(x, maxf)) +
  geom_point() +
  facet_wrap(~group)

ecdf of the sample data

I'm still not really sure if that's what you need.

Sample data

To be honest it took me most of the time to 'fake' your dataset:

library(tidyverse)

tibble(x = seq_len(300) + 100) %>%
  mutate(
    one = - 1e-3 * (x * x) + 50 + 0.7 * x,
    two = - 1e-3 * (x * x) + 55 + 0.68 * x,
    three = - 1e-3 * (x * x) + 110 + 0.5 * x,
    four = - 1e-3 * (x * x) + 10 + 0.8 * x) %>%
  pivot_longer(-x, names_to = "group", values_to = "y") %>%
  filter(
    group == "one"
    | group == "two"
    | (group == "three" & x < 200)
    | (group == "four" & x > 250)) ->
  d_sample

d_sample %>%
  ggplot(aes(x, y, colour = group)) +
  geom_point()

sample data scatter plot

liborm
  • 2,634
  • 20
  • 32