0

I've been trying to plot a graph, but for some reason I keep having variables removed. I have a dataframe with 350 observations and 11 variables, but when I try plot my graph, 140 of the observations are removed.

I started off by modifying the dataframe in order to plot against time over two consecutive days:

library(hms)
library(dplyr)
library(ggplot2)
library(tidyr)
library(tidyverse)

    #Generate sample data
df <- data.frame(hour = hms(sample(0:59, replace = TRUE, 350),
                             sample(0:59, replace = TRUE, 350),
                             sample(c(0:5, 20:23), replace = TRUE, 350)),
                  count = floor(runif(350, min=0, max=20)))

df <- df %>%
  mutate(ai = count/10) %>% 
  mutate(graphing.date = if_else(
    hour > parse_hms("12:00:00"), as.Date("2022-02-05"), 
    as.Date("2022-02-06")), 
    graphing.datetime = as.POSIXct(paste(graphing.date, hour)))

And then I use the variables graphing.datetime and ai as my x and y variables respectively:

p <- ggplot(df, aes(x = graphing.datetime, y = ai)) + 
  geom_point() + 
  scale_x_datetime("Time",
                   limits = c(as.POSIXct("2022-02-05 20:00:00"),
                              as.POSIXct("2022-02-06 06:00:00")),
                   date_breaks = "1 hours", 
                   date_labels = "%H:%M") 
p

When I do this, I get the following message:

Warning message: Removed 140 rows containing missing values (geom_point).

What can I do to fix this? is there anything wrong with my code that I need to fix?

  • 2
    Welcome to SO, Haddonchris031! Questions on SO (especially in R) do much better if they are reproducible and self-contained. By that I mean including attempted code (you have this, but please be explicit about non-base packages), sample representative data (perhaps via `dput(head(x))` or building data programmatically (e.g., `data.frame(...)`), possibly stochastically), perhaps actual output (with verbatim errors/warnings) versus intended output. Refs: https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info. – r2evans Mar 31 '22 at 20:24
  • 4
    It probably just means that there are dates outside the limits you set in the scale, which then are not displayed. In particular, the `oob` argument handles what hapens to out-of-bounds values. The default is to set these to `NA`, which are then dropped. – teunbrand Mar 31 '22 at 20:24
  • 1
    Thank you @r2evans for your feedback! I wasn't entirely sure how to actually post on here so I really do appreciate the feedback. I made some edits, so I hope its better now? – Haddonchris031 Mar 31 '22 at 21:04
  • Much better, thank you for the update. When using random data, it is often (but not always) useful to start with a known random seed (e.g., `set.seed(42)`) so that you and all of us are looking at the same random data, otherwise we will never be able to accurately reproduce your data and vice versa. In this case, it _should_ still produce the warning (albeit with a different number of removed rows), but it is possible that it won't. Set the seed in the code here and update the warning output. – r2evans Mar 31 '22 at 21:10

1 Answers1

0

Your sample data is insufficient (for me) to trigger the error, so I'll adapt it:

set.seed(42)
# ... your code ...
head(df)
#       hour count  ai graphing.date   graphing.datetime
# 1 00:58:48     5 0.5    2022-02-06 2022-02-06 00:58:48
# 2 23:48:36     7 0.7    2022-02-05 2022-02-05 23:48:36
# 3 02:29:00    18 1.8    2022-02-06 2022-02-06 02:29:00
# 4 21:58:24    13 1.3    2022-02-05 2022-02-05 21:58:24
# 5 02:20:09    14 1.4    2022-02-06 2022-02-06 02:20:09
# 6 01:43:35    14 1.4    2022-02-06 2022-02-06 01:43:35

From here, notice that the range of observations' datetime is within your p filters:

range(df$graphing.datetime)
# [1] "2022-02-05 20:00:25 EST" "2022-02-06 05:59:52 EST"

This means that if we do your plot right now, it produces no warnings, and the plot shows all.

p <- ggplot(df, aes(x = graphing.datetime, y = ai)) + 
  geom_point() + 
  scale_x_datetime("Time",
                   limits = c(as.POSIXct("2022-02-05 20:00:00"),
                              as.POSIXct("2022-02-06 06:00:00")),
                   date_breaks = "1 hours", 
                   date_labels = "%H:%M") 
p

ggplot2 with no warnings

However, let's tighten the limits= a bit, and we'll get the warnings:

p <- ggplot(df, aes(x = graphing.datetime, y = ai)) +

p <- ggplot(df, aes(x = graphing.datetime, y = ai)) + 
  geom_point() + 
  scale_x_datetime("Time",
                   limits = c(as.POSIXct("2022-02-05 20:10:00"),
                              as.POSIXct("2022-02-06 05:50:00")),
                   date_breaks = "1 hours", 
                   date_labels = "%H:%M") 
p
# Warning: Removed 17 rows containing missing values (geom_point).

This means (as @teunbrand identified) that you are forcing the condition that data is being removed from the plot. Some thoughts:

  • Expand the limits= so that it includes all the data you expect. While you could assign it programmatically based on observed data, that seems less useful since it would be effectively the same as not setting limits= at all.

  • "Ignore" the warning, as it is not a surprise: you are knowingly limiting inside of the range of available data, so you shouldn't be surprised.

  • If you really want to silence this warning when printing/rendering it, you can do suppressWarnings(print(p)) instead of just p at the end. (Other warnings that might be useful will also be suppressed/hidden, so use this with care.)

  • Limit the boundaries using coord_cartesian instead of scale_*_datetime. This has the benefit of "clipping" the data (which can absolutely be required when working with lines/segments). However, when using it indiscriminantly, it has the side-effect of not telling you if or how many data are outside the boundaries of your plot. (This is why we're using it here, yes, just know that when using this one should be deliberate about it.)

    p <- ggplot(df, aes(x = graphing.datetime, y = ai)) + 
      geom_point() + 
      scale_x_datetime("Time",
                       date_breaks = "1 hours", 
                       date_labels = "%H:%M") +
      coord_cartesian(xlim = c(as.POSIXct("2022-02-05 20:10:00"),
                               as.POSIXct("2022-02-06 05:50:00")))
    
r2evans
  • 141,215
  • 6
  • 77
  • 149