0

I am trying to create a scatterplot with date/time on the x-axis and salinity on the y-axis. However, there are some date/time points which do not have a salinity value due to equipment failure, but I still need these portions of time to show on my graph to help explain the ecological patterns I am looking at. Can anyone advise on how to keep these missing sections shown on the graph?

My current code for the data and the plot which does not show the missing values.

Edit My data has explicit missing values where data was removed due to logger errors but is listed as an 'NA' (see photo). Unfortunately I have thousands of data points collected half hourly so it is difficult to show all of the data.

Screenshot of data showing 'NA' values

OY1_AllTimes <- read_csv("~/Documents/TAMUG_Thesis/Rollover_Pass_Data/Logger/RP_LoggerData_OY1_AllTimes.csv")
summary(OY1_AllTimes)

OY1_AllTimes$Date_time<-paste(OY1_AllTimes$Date, OY1_AllTimes$Time)
summary(OY1_AllTimes$Date_time)

date_time_OY1_AllTimes<-as.POSIXct(OY1_AllTimes$Date_time, format="%m/%d/%Y %H:%M")
date_time_OY1_AllTimes
date_time2_OY1_AllTimes<-as.factor(date_time_OY1_AllTimes)
date_time2_OY1_AllTimes
summary(OY1_AllTimes)

Summary of OY1_AllTimes

p_OY1_AllTimes <- ggplot(data = OY1_AllTimes, aes(x=date_time2_OY1_AllTimes, y=Salinity)) + geom_point() + theme_classic()+
  scale_x_discrete("Date", breaks=c("0019-10-04 09:30:00", "0019-11-01 05:00:00", "0019-12-01 00:00:00", "0020-01-01 00:00:00", "0020-02-01 00:00:00",
                                    "0020-03-01 00:00:00","0020-04-01 00:00:00", "0020-05-01 00:00:00", "0020-06-01 00:00:00"),
                   labels=c("10/2019", "11/2019", "12/2019", "1/2020", "2/2020", "3/2020", "4/2020", "5/2020", "6/2020"))+ylab("Salinity")+ggtitle("OY1")
p_OY1_AllTimes

Scatterplot of OY1 without missing values

Essentially I would like to see the above scatterplot with gaps representing the periods without salinity data so that the date/time scale is continuous.

Subsample of data:

# A tibble: 50 x 5
   Site  Date    Time   Salinity Date_time       
   <chr> <chr>   <time>    <dbl> <chr>           
 1 OY1   10/4/19 09:30    NA     10/4/19 09:30:00
 2 OY1   10/4/19 10:00    NA     10/4/19 10:00:00
 3 OY1   10/4/19 10:30     0.891 10/4/19 10:30:00
 4 OY1   10/4/19 11:00     0.961 10/4/19 11:00:00
 5 OY1   10/4/19 11:30     1.02  10/4/19 11:30:00
 6 OY1   10/4/19 12:00     1.10  10/4/19 12:00:00
 7 OY1   10/4/19 12:30     1.19  10/4/19 12:30:00
 8 OY1   10/4/19 13:00     1.27  10/4/19 13:00:00
 9 OY1   10/4/19 13:30     1.33  10/4/19 13:30:00
10 OY1   10/4/19 14:00     1.42  10/4/19 14:00:00
# … with 40 more rows```
  • 1
    I would think you'd want to use the `date_time_OY1_AllTimes` variable instead of the converted factor version, and drop the `scale_x_discrete`. Then you should get a continuous date axis scaled based on the underlying timestamps instead of just stacked in sequence. – Jon Spring Aug 20 '21 at 18:26
  • @JonSpring unfortunately this still produces a graph that removes all my 'NA' data. I need to show these as gaps in the time-series. – Ashley McDonald Aug 27 '21 at 16:05
  • Can you explain more what you mean by "show these as gaps"? Does that mean you want text on the axis for each missing point? I had thought my answer below was "showing the gaps" by having a time axis with points missing from a section, but maybe I'm not understanding yet. – Jon Spring Aug 27 '21 at 16:39
  • @JonSpring So I want what you have shown above but when I tried to do it, R is still removing my rows that have missing values and giving me other errors. I am successfully running your code for the first graph but am getting the Warning message: removed 3831 rows containing missing values (geom_point). For your second graph I am getting an error for an unused arguemnt (data_labels = "%b\n'%y"). For your third graph I am getting the error 'breaks' and 'labels' must have the same length. – Ashley McDonald Aug 27 '21 at 21:31
  • The warning you saw is because your data included NA's in at least one column you're using in the ggplot for 3831 rows of your data. That might be fine if that's what you expect. The second error might arise if your `Date_time` column is not datetime data (typically POSIXct). Maybe it's character or factor data? (What is `str(OY1_AllTimes$Date_time)`)? In any case, it will be much easier to help if you can include a sample of data *in the form of code* in your question, as described here: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Jon Spring Aug 27 '21 at 22:47
  • @JonSpring I have added a subsample of my data above (I think, let me know if that isn't what you were looking for). I do expect there to be NA's I just don't want them to be excluded from my analysis. It looks like my OY1_AllTimes$Date_time is showing up as a character but I am having some difficultly sorting out how to convert it without losing the necessary date/time format and information. – Ashley McDonald Aug 30 '21 at 19:29
  • Try `OY1_AllTimes$Date_time = lubridate::mdy(OY1_AllTimes$Date_time)`, or in your code you probably need `%y` instead of `%Y` if your years are just the last two digits. – Jon Spring Aug 30 '21 at 20:53

2 Answers2

0

Here's my attempt at demonstrating this with some reproducible code that we can all run.

Here's some arbitrary fake data. What's important is that it has a big gap in the timestamps, since I removed a few hundred rows from 100:399. At this point timestamp is stored as datetime data, in the "POSIXct" variety, the most typical, and the same as your date_time_OY1_AllTimes variable.

set.seed(42)
my_fake_data <- data.frame(timestamp = as.POSIXct("2021-01-01 00") + cumsum(runif(1000, 0, 6E4)), reading = cumsum(rnorm(1000)))
my_fake_data <- my_fake_data[c(1:99, 400:1000),]

The typical thing in ggplot2 is to plot using that POSIXct value. You'll see the gap. ggplot2 maps the timestamp to the x axis, and picks the default labels for us.

ggplot(my_fake_data, aes(timestamp, reading)) +
  geom_point() 

enter image description here

If we want monthly labels, we can specify that and the format we want to see:

... + scale_x_datetime(date_breaks = "month", 
                       date_labels = "%b\n'%y", minor_breaks = NULL)

enter image description here

In your example, the timestamps have been converted to factors, which preserves their sequence, but it removes them from their context in time, so the gaps have disappeared. Here I've added discrete labels manually, but they no longer have an explicit relationship in time to my data points. I can make them say whatever I want, and they'll be wrong unless I put in some work to align them manually.

ggplot(my_fake_data, aes(as.factor(timestamp), reading)) +
  geom_point() +
  scale_x_discrete(breaks = as.factor(my_fake_data[1+100*0:7,1]),
                   labels = format(
                     seq.Date(as.Date("2021-01-01"), 
                              as.Date("2021-08-01"), by = "month"), "%b %Y"))

enter image description here

Jon Spring
  • 55,165
  • 4
  • 35
  • 53
0

It is quite hard to tell what your data really looks like, I am ussuming you have a implicit missing data problem.

Which would mean, you have a data.frame/time series with missing observations. But the problem is, these missing values are not explicitly given as NAs. Instead these are just left out.

A time series with NAs would look like this:

1.1.2021 14:00
1.1.2021 15:00
1.1.2021 16:00
1.1.2021 17:00
1.1.2021 NA
1.1.2021 19:00

I guess your problem looks like this:

1.1.2021 14:00
1.1.2021 15:00
1.1.2021 16:00
1.1.2021 17:00
1.1.2021 19:00

So the difference is, there is no NA value for the 18:00 timestep. But, of course you do know there is a missing values (thats why it is called implicit missing value).

Assuming you have a regular spaced time series (meaning values measured in regular intervals e.g. 1h), you can use the tsibble package to convert the implicit missing values to normal missing values, where you have the NAs is the series.

Here is an easy example (as I don't have your data):

library("tsibble")

# Read in your data as tsibble
data_example <- tsibble(
  year = c(2016, 2017, 2018, 2019, 2021, 2022),
  measure = sample(1:10, size = 6),
  index = year
)

# Take a look at the data
data_example

# Use the fill_gaps function of tsibble
data_na <- fill_gaps(data_example, .full = TRUE)

# You can see now, the implicit missing year 2020 is now added as NA  
data_na

You can of course also do this for all kinds of different regular spaced time series data (15 seconds, minute, hour, month, ...). You just have to define the time step this while creating your tsibble object.

Plotting is easy now:

library("ggplot2")
ggplot(data = data_na) + geom_point( aes(year, measure))

This will give you this plot: enter image description here

As you can see, as you wanted, the series is plotted, but there is just no for 2020 in the plot. If you want to put more focus to the missing data, you can also use the imputeTS package.

library("imputeTS")
ggplot_na_distribution(data_na)

This would then look like this: enter image description here

This is only a small example time series, for larger time series this would look like e.g. this plot: enter image description here

Steffen Moritz
  • 7,277
  • 11
  • 36
  • 55
  • Sorry for the lack of clarity, I actually have explicit missing values which show as 'NA' in my R data (I will attach a photo above). However, when I try to plot these R still removes my rows containing missing values. Would tsibble still work in this case? – Ashley McDonald Aug 27 '21 at 16:01