3

I have a data set which X values are integers from 1 to several thousandth and want to plot the mean Y and a measure of dispersion around that mean. The problem I have is that there are some missing X values. Therefore, when using the geom_line and geom_ribbon functions the plot is continuous and I can not find a way to make it leave blanks where there is no data.

Here is a mock up reproducible example.

data.1 <-read.csv(text = "
Treatment, X, Y_value
A,1,120.5
B,1,123.6
C,1,100.4
A,2,120.9
B,2,123.9
C,2,101.0
A,3,122.3
B,3,126.6
C,3,102.3
A,6,124.8
B,6,128.0
C,6,105.5
A,7,129.5
B,7,129.4
C,7,108.9
A,8,132.9
B,8,130.6
C,8,113.9
A,9,137.6
B,9,136.0
C,9,115.3
A,10,138.4
B,10,139.6
C,10,118.9
A,11,143.9
B,11,145.9
C,11,126.6
")

data.1 <- data.1 %>% group_by(X) %>% summarise(mean.y = mean(Y_value),
                                                  sd.y = sd(Y_value))

library(ggplot2)
ggplot(data.1, aes(X, mean.y)) +
        geom_line(color="red") +
        geom_ribbon(aes(ymin=mean.y-sd.y, ymax=mean.y+sd.y), alpha=0.4) +
        scale_x_continuous(limits=c(0,11), breaks = c(seq(min(0),max(11), length.out = 12)))+
        theme_bw() +
        theme(panel.grid.minor = element_blank(),
              panel.grid.major = element_blank())

Here is the output I am getting:

enter image description here

And this is what I would like to get:

enter image description here

Any hint on how to accomplish this would be really appreciated.

Thanks

Giuseppe Petri
  • 604
  • 4
  • 14
  • Does this answer your question? [How to plot a line graph with discontinuity data?](https://stackoverflow.com/questions/22207246/how-to-plot-a-line-graph-with-discontinuity-data) or [Can you make geom_ribbon leave a gap for missing values?](https://stackoverflow.com/questions/35454277/can-you-make-geom-ribbon-leave-a-gap-for-missing-values) – hamagust Jun 10 '20 at 20:52
  • @hamagust, Thanks for the reply. I have checked both posts but they didn't solve my problem. Regarding this one https://stackoverflow.com/questions/22207246/how-to-plot-a-line-graph-with-discontinuity-data, I'd need to identify before hand each discontinuity. I have several thousandth X data and there is no pattern in the missing data.. So it would need to be make automatically. I couldn't understand how the https://stackoverflow.com/questions/35454277/can-you-make-geom-ribbon-leave-a-gap-for-missing-values solution works. – Giuseppe Petri Jun 10 '20 at 21:05

1 Answers1

6

You can add grouping column to mark X values above and below the cutoff. In this case, I've hard-coded the criterion, but in general you can do it programmatically if you have criteria for where the discontinuities should be.

For example:

ggplot(data.1, aes(X, mean.y, group=X<5)) +
  geom_line(color="red") +
  geom_ribbon(aes(ymin=mean.y-sd.y, ymax=mean.y+sd.y), alpha=0.4) +
  scale_x_continuous(limits=c(0,11), breaks = 0:12) +
  theme_bw() +
  theme(panel.grid.minor = element_blank(),
        panel.grid.major = element_blank())

Or, if our criterion is to have a discontinuity whenever the distance between x-values is greater than one:

data.1 %>% 
  mutate(g = c(0, cumsum(diff(X) > 1))) %>%
    ggplot(aes(X, mean.y, group=g)) +
      geom_line(color="red") +
      geom_ribbon(aes(ymin=mean.y-sd.y, ymax=mean.y+sd.y), alpha=0.4) +
      scale_x_continuous(limits=c(0,11), breaks = 0:12) +
      theme_bw() +
      theme(panel.grid.minor = element_blank(),
            panel.grid.major = element_blank())

Either way, here's the resulting plot:

enter image description here

Here's some additional explanation to answer the question in the comment regarding how the mutate step creates the grouping column: We want to create a grouping variable that separates X values before and after a discontinuity. In the code above, we do that with a combination of the diff and cumsum functions.

diff calculates lagged differences. For example:

diff(data.1$X)
[1] 1 1 3 1 1 1 1 1

Note that one of the differences (the one between 3 and 6) is 3. Now let's add a logical condition:

diff(data.1$X) > 1
[1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE

So now we have a vector of logical values where TRUE marks differences greater than one. cumsum will treat TRUE as equal to 1 and FALSE as equal to zero. The value of the cumulative sum will increment by one each time we encounter a TRUE, and will stay constant when we encounter a FALSE.

cumsum(diff(data.1$X) > 1)
[1] 0 0 1 1 1 1 1 1

Okay, now we have two groups, marking the X values before and after the discontinuity (if there are multiple discontinuities, we'll get a new group for each one). But we're not quite done.

Note that diff takes a vector of length n and returns a vector of length n-1. This is simply because there are only n-1 lagged differences between n values. Thus, we add a leading zero to get a vector that's the same length as the input data:

c(0, cumsum(diff(data.1$X) > 1))
[1] 0 0 0 1 1 1 1 1 1
eipi10
  • 91,525
  • 24
  • 209
  • 285
  • Thanks so much for the reply. Great solution. It works perfect. I am not completely understanding the code but the end result is great. ```mutate(g = c(0, cumsum(diff(X) > 1)``` so here you are creating a new variable g that begin with zero. Is ```diff(X)``` calculating the difference in X between X+1 and X? In any case, thanks again. – Giuseppe Petri Jun 10 '20 at 21:24
  • 1
    I'd like to upvote your answer 10 times if possible. Thanks so much for the detailed explanation. – Giuseppe Petri Jun 10 '20 at 21:47