Cumulative distribution function of count as ratio for subset of data

Question

I am trying to plot CDFs from multiple data on one plot for a subset range. I subset them as the values can be very large and I do not wish to have a large x-axis range. Regardless of the subset range, the CDF always has a y-axis range from 0 to 1. As data exists outside of the subset range, the CDF should never go to 1, but rather a slightly smaller ratio. How might I go about computing CDFs for the whole distribution, then subset them in the plot?

This code plots CDFs, however they do not respect that there exists data beyond the x-axis range. At or around x=50, y=1, which is impossible. I have tried a few other stat_ecdf options (commented # below) with no success.

library(moments)
library(ggplot2)
library(plyr)
library(dplyr)
library(reshape2)
library(RColorBrewer)
library(cowplot)
library(scales)
library(gridExtra)
require(data.table)
require(grid)

set.seed(8)

dat1 <- data.frame(a = replicate(1,sample(25:300,1000,rep=TRUE)))
dat2 <- data.frame(b = replicate(1,sample(25:350,950,rep=TRUE)))
dat3 <- data.frame(c = replicate(1,sample(25:400,965,rep=TRUE)))
dat4 <- data.frame(d = replicate(1,sample(25:450,970,rep=TRUE)))

d1_bind = bind_rows(dat1,dat2,dat3,dat4)

md1 <- melt(d1_bind)
colnames(md1) <- c("Dat","Value")
summary(md1)

ggplot(md1, aes(x = Value, color=Dat, linetype=Dat)) +
           stat_ecdf(aes(color = Dat),
#           pad = TRUE, # this does not plot correctly
#           n = 38850, # this or set to NULL does not plot correctly
           geom = "line", size = 1) +
           scale_linetype_manual(values=c("solid", "solid", "solid", "solid")) +
           scale_y_continuous(limits = c(0, 1.0), breaks = seq(0, 1.0, by = 0.05)) +
           scale_x_continuous(limits = c(25, 50)) +
#           scale_x_discrete(breaks = 26:451) + # this does not plot correctly
           scale_color_manual(values = c("#000000", "#E69F00", "#56B4E9", "#009E73"))

quit()

Using stat_bin and manually computing the cumulative sum, results in the same plot as the stat_ecdf above.

ggplot(md1, aes(x = Value, color=Dat, linetype=Dat)) +
  stat_bin(aes(y = cumsum(..count..)/sum(..count..)),
  geom = "line", size = 1) +
  scale_linetype_manual(values=c("solid", "solid", "solid", "solid")) +
  scale_y_continuous(limits = c(0, 1.0), breaks = seq(0, 1.0, by = 0.05)) +
  scale_x_continuous(limits = c(25, 50)) +
  scale_color_manual(values = c("#000000", "#E69F00", "#56B4E9", "#009E73"))

In your case, the ECDF should go 1 above certain values: 300 in `dat1`, 350 in `dat2`, 400 in `dat3`, and 450 in `dat4`. You do not observe values greater than those in each respective data frame. So the ECDF should be one above those values. You specify CDF, but use `stat_ecdf` in your code. Are you trying to plot the theoretical CDF of some function? — LMc, May 06 '21 at 21:15
Also please set a seed for your data since you're randomly sampling so we have the same sampled data you have. — LMc, May 06 '21 at 21:15
Further, your random data is `rbind`'d, but because all four have different column names, you end up with all rows having one non-`NA` values and 3 `NA`s. Is that intentional? (After reshaping, I see 37.5% `NA` in `md1`.) — r2evans, May 06 '21 at 21:19
(FYI, don't use `require` like that, see https://stackoverflow.com/a/51263513/3358272. If you want to use it, fine, but do something with its return value.) — r2evans, May 06 '21 at 21:23
I use rbind as I'm not familiar with any other way to put together and melt multiple data frames with differing number of rows (and preserve each row, not throw some out). These are CDFs (not theoretical), and I'm only familiar with stat_ecdf as the only function in R to plot CDFs. I have updated the first example to use seed 8. All lines reach 1 at or near x=50, which is not possible as data exists beyond 50. Perhaps there is a better way to plot this versus using stat_ecdf? — user2030765, May 06 '21 at 21:29
The computation would go: sum of count at bin = 1 / total count, sum of counts at bins 1 and 2 / total count, sum of counts at bins 1, 2 and 3 / total count, .... This would then be plotted for some x-axis range (e.g., 25-50, as using an x-axis for the full spread of values is too large). — user2030765, May 06 '21 at 21:40
@user2030765 as I've shown in my post, all lines do not reach 1 at or near `x = 50` as you mentioned above. So `stat_ecdf` is plotting as expected. When you scale the x axis, the ECDF is calculated for values within that range. — LMc, May 06 '21 at 21:50

LMc · Accepted Answer · 2021-05-06T21:59:59.170

Your code and the following code gives, which is what I would expect:

library(dplyr)
library(tidyr)
library(ggplot2)

set.seed(8)

dat1 <- data.frame(a = replicate(1,sample(25:300,1000,rep=TRUE)))
dat2 <- data.frame(b = replicate(1,sample(25:350,950,rep=TRUE)))
dat3 <- data.frame(c = replicate(1,sample(25:400,965,rep=TRUE)))
dat4 <- data.frame(d = replicate(1,sample(25:450,970,rep=TRUE)))

df <- bind_rows(dat1, dat2, dat3, dat4, .id = "dat")

df1 <- df %>% 
  pivot_longer(cols = a:d, values_drop_na = T)

ggplot(df1, aes(x = value, color = dat, linetype = dat)) + 
  stat_ecdf(aes(color = dat))

If you want to set the limits without recalculating the ECDF (ie "zoom" in on the graph) then use coord_cartesian not scale_x_continuous:

ggplot(df1, aes(x = value, color = dat, linetype = dat)) +
  stat_ecdf() + 
  coord_cartesian(xlim = c(25, 50),
                  ylim = c(0, 0.1))

I would expect this. However, I want to "zoom" in to the left side for x-axis between 25-50. I do not want to plot any values larger than this, as they become too large and make the plot difficult to view. Here, using seed 8 in your figure, I'd like the plot to go between 25 and 50, where at x=50 the lines max ~0.08 (respective of the dat lines). Using ```scale_x_continuous``` does not work, as the values at x=50 reach 1, which is not possible as they are in your figure ~0.8. — user2030765, May 06 '21 at 21:58

Cumulative distribution function of count as ratio for subset of data

1 Answers1