0

Problem

I have a very large data frame (almost 20,000 rows) where data was documented approximately every 1-3 minutes. Unfortunately, I am unable to upload any authentic data to this post.

Aim

The aim is to reduce the size of the data frame by filtering the rows by date and time into 5-minute intervals for each month. The data spans over 4 years.

I have tried using different packages and functions in R to figure out how to do this over the last couple of days to no avail such as dplyr(), aggregate(), and tidyverse(), and I just can't solve this conundrum.

Structure of the data frame

I have a data frame like this

ID    Date       Time    
 1 9/15/16 6:48:00 AM 
 2 9/15/16 6:54:00 AM 
 3 9/15/16 6:57:00 AM 
 4 9/15/16 6:59:00 AM 
 5 9/15/16 7:03:00 AM 
 6 9/15/16 7:05:00 AM 

I would like to convert this data into a data frame like the example below by calculating the number of minutes and seconds between each subsequent 'ID' row.

    ID    Date Start_Time    End_Time Minutes Seconds
     1 9/15/16 6:48:00 AM  6:54:00 AM     5.0  300.00
     2 9/15/16 6:54:00 AM  6:57:00 AM     3.0  180.00
     3 9/15/16 6:57:00 AM  6:59:00 AM     2.0  120.00
     4 9/15/16 6:59:00 AM  7:03:00 AM     4.0  240.00
     5 9/15/16 7:03:00 AM  7:05:00 AM     2.0  120.00
     6 9/15/16 7:05:00 AM  etc

Afterwards, I'd like to filter the data frame containing the new calculations between the subsequent rows of 'IDs' by date and time into 5-minute or 300.0-second time intervals per month to reduce the size of the data frame.

The output should be something like this unless someone has a more efficient method.

    ID    Date Start_Time    End_Time Minutes Seconds
     1 9/15/16 6:48:00 AM  6:54:00 AM     5.0  300.00
     

I appreciate your thoughts on this

Many thanks in advance.

Progress

Many many thanks in advance for this solution, it worked really well. Sorry for all these questions, I am a novice with R. Could I please query what the warning message below means (ℹ 232 failed to parse) and what went wrong with the calculations of minutes and seconds for the rows in my data frame(' New_Track' - see below)? The minutes and seconds are exactly the same in all columns called 'Minutes' and 'Seconds'. In addition, for IDs 7 and 8, the new calculations show the difference is 950520 minutes and seconds when the correct calculation is approximately 2 minutes or 120 seconds (see below).

Row 7

7     7  9/15/16  7:07:00 AM 2016-09-15 07:07:00 2016-09-26 07:09:00  950520M 0S  950520
8     8  9/26/16  7:09:00 AM 2016-09-26 07:09:00 2016-09-26 07:11:00     120M 0S     120

Warning Message:

##My data frame is called 'track' and the columns are: (1) ID; (2) Date; and (3) Time

##Code:

New_Track <- data.frame(
+                       stringsAsFactors = FALSE,
+                       ID = track$ID,
+                       Date = track$Date,
+                       Time = track$Time
+                        ) %>%
+                             mutate(Start_Time = mdy_hms(paste(Date, Time)),
+                             End_Time = lead(Start_Time),
+                             Minutes = minutes(End_Time-Start_Time),
+                             Seconds = (End_Time-Start_Time) / dseconds(1))
Warning message:
Problem while computing `Start_Time = mdy_hms(paste(Date, Time))`.
ℹ  232 failed to parse. 

New Data frame Layout - 'New_Track'

  ID     Date        Time          Start_Time            End_Time     Minutes Seconds
1     1  9/15/16  6:48:00 AM 2016-09-15 06:48:00 2016-09-15 06:54:00     360M 0S     360
2     2  9/15/16  6:54:00 AM 2016-09-15 06:54:00 2016-09-15 06:57:00     180M 0S     180
3     3  9/15/16  6:57:00 AM 2016-09-15 06:57:00 2016-09-15 06:59:00     120M 0S     120
4     4  9/15/16  6:59:00 AM 2016-09-15 06:59:00 2016-09-15 07:03:00     240M 0S     240
5     5  9/15/16  7:03:00 AM 2016-09-15 07:03:00 2016-09-15 07:05:00     120M 0S     120
6     6  9/15/16  7:05:00 AM 2016-09-15 07:05:00 2016-09-15 07:07:00     120M 0S     120
7     7  9/15/16  7:07:00 AM 2016-09-15 07:07:00 2016-09-26 07:09:00  950520M 0S  950520
8     8  9/26/16  7:09:00 AM 2016-09-26 07:09:00 2016-09-26 07:11:00     120M 0S     120
9     9  9/26/16  7:11:00 AM 2016-09-26 07:11:00 2016-09-26 07:13:00     120M 0S     120 
Alice Hobbs
  • 1,021
  • 1
  • 15
  • 31
  • You can use something like: `cut(x, breaks="5 min")` to divide you dataset into 5 minute intervals, then summarize the intervals with the dplyr package. – Dave2e Mar 31 '22 at 02:59
  • Thanks, Dave2e. I have just researched the parameters for cut(). Would x be combined vectors: x = c(ID, Date, Time), then Breaks<-cut(x, breaks="5 min"). If you can expand on your answer, I would be really grateful. – Alice Hobbs Mar 31 '22 at 04:41

1 Answers1

0
library(dplyr); library(lubridate)
data.frame(
  stringsAsFactors = FALSE,
                ID = c(1L, 2L, 3L, 4L, 5L, 6L),
              Date = c("9/15/16","9/15/16","9/15/16",
                       "9/15/16","9/15/16","9/15/16"),
              Time = c("6:48:00","6:54:00","6:57:00",
                       "6:59:00","7:03:00","7:05:00")
) %>%
  mutate(Start_Time = mdy_hms(paste(Date, Time)),
         End_Time = lead(Start_Time),
         Minutes = minutes(End_Time-Start_Time),
         Seconds = (End_Time-Start_Time) / dseconds(1))

Result

  ID    Date    Time          Start_Time            End_Time Minutes Seconds
1  1 9/15/16 6:48:00 2016-09-15 06:48:00 2016-09-15 06:54:00   6M 0S     360
2  2 9/15/16 6:54:00 2016-09-15 06:54:00 2016-09-15 06:57:00   3M 0S     180
3  3 9/15/16 6:57:00 2016-09-15 06:57:00 2016-09-15 06:59:00   2M 0S     120
4  4 9/15/16 6:59:00 2016-09-15 06:59:00 2016-09-15 07:03:00   4M 0S     240
5  5 9/15/16 7:03:00 2016-09-15 07:03:00 2016-09-15 07:05:00   2M 0S     120
6  6 9/15/16 7:05:00 2016-09-15 07:05:00                <NA>    <NA>      NA

It's still unclear to me what you're looking for for this:

I'd like to filter the data frame containing the new calculations between the subsequent rows of 'IDs' by date and time into 5-minute or 300.0-second time intervals per month to reduce the size of the data frame.

Jon Spring
  • 55,165
  • 4
  • 35
  • 53
  • Thank you so much for your help. This is a boat track line and I will be plotting it on a map in QGIS. Coordinates (longitude and latitude, date, and time) of the boat were taken every 1-3 minutes. So the analysis is less intensive, Therefore, I would like to plot point data every five minutes instead of every 1-to 3 minutes. I thought this method decreased the size of my data frame. If you have any better suggestions, I would happily receive them. – Alice Hobbs Mar 31 '22 at 03:50
  • I inputted my code from your solution above and I was wondering if I could please query a warning message and the calculation for IDs 7 and 8. Many thanks in advance for helping with my understanding – Alice Hobbs Mar 31 '22 at 04:14
  • 1
    1) It will be much easier to help if you share sample data in a reproducible way. See https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example. `dput()` is often the quickest way. 2) It sounds like 232 of your 20k rows have inconsistent date or time formatting, you can use `resultant_data_frame %>% filter(is.na(Start_Time))` to see what those rows look like. 3) Your row 8 is 11 days later, not 2 minutes later. 4) I think your "thin the data" question should be a separate question. You might filter to keep only rows where enough time or distance has changed. – Jon Spring Mar 31 '22 at 15:43