1

I have a data frame of ~20,000 rows that is two columns: Time and Steps. There are stretches duplicate/repeated readings of steps that are a few seconds to a few minutes apart, so unique() does not catch them. !duplicates removes duplicates for the whole data frame, so that there is only one row with X steps, even if a legit record days later has X steps. I want to remove these stretches of repeats, leaving one unique record, it doesn't matter which one is kept. Also, the number of repeats is not consistent. It may be 8, it may be 2.

>View(df) 
          Time                Steps
    13737 2017-08-07 09:53:14 16
    13738 2017-08-07 09:53:14 16
    13739 2017-08-07 09:57:51 61
    13740 2017-08-07 09:59:05 61
    13741 2017-08-07 09:59:07 61
    13742 2017-08-07 09:59:11 61
    13743 2017-08-07 09:59:14 61
    13744 2017-08-07 09:59:14 61
    13745 2017-08-07 10:03:51 29
    13746 2017-08-07 10:05:05 29
    13747 2017-08-07 10:05:07 29
    13748 2017-08-07 10:05:11 29
    13749 2017-08-07 10:05:14 29
    13750 2017-08-07 10:05:14 29
    13751 2017-08-07 10:09:51 54
    13752 2017-08-07 10:11:05 54
    13753 2017-08-07 10:11:07 54

Desired output:

13737 2017-08-07 09:53:14 16
13739 2017-08-07 09:57:51 61
13745 2017-08-07 10:03:51 29
13751 2017-08-07 10:09:51 54

Another solution I've pondered is just paring down the dataset to just take rows every 10 minutes or so, and if that is ideal, I could use some guidance there too

www
  • 38,575
  • 12
  • 48
  • 84
jstewartmitchel
  • 171
  • 3
  • 3
  • 11
  • Questions of this kind are pretty common. A base R method would be `dat[c(TRUE, diff(dat$Step) != 0), ]`. – lmo Aug 17 '17 at 20:28
  • Not entirely clear, but seems like a dupe of [Select only the first rows for each unique value of a column in R](https://stackoverflow.com/questions/13279582/select-only-the-first-rows-for-each-unique-value-of-a-column-in-r/13279904#13279904). That is, specify a desired variable in `!duplicated()` (here, "Steps"?) – Henrik Aug 17 '17 at 20:30
  • I have seen tons of questions about removing duplicated rows, but nothing that addresses my need to only remove duplicates in a stretch of repeats while leaving values further down in place – jstewartmitchel Aug 17 '17 at 20:45

1 Answers1

0

We can use dplyr. Arrange the data frame by Steps and Time first and then group by Steps. Finally, slice the data frame for the first row per group.

library(dplyr)
library(lubridate)

df2 <- df %>%
  mutate(Time = ymd_hms(Time)) %>%
  arrange(Steps, Time) %>%
  group_by(Steps) %>%
  slice(1)

Or this

df2 <- df %>%
  mutate(Time = ymd_hms(Time)) %>%
  arrange(Steps, Time) %>%
  group_by(Steps) %>%
  filter(row_number() == 1)

Or we can use the data.table.

library(data.table)
dt <- as.data.table(df)
dt2 <- dt[, .SD[1], by = Steps][, c("Time", "Steps")]

Data Preparation

df <- read.table(text = "          Time                Steps
    13737 '2017-08-07 09:53:14' 16
    13738 '2017-08-07 09:53:14' 16
    13739 '2017-08-07 09:57:51' 61
    13740 '2017-08-07 09:59:05' 61
    13741 '2017-08-07 09:59:07' 61
    13742 '2017-08-07 09:59:11' 61
    13743 '2017-08-07 09:59:14' 61
    13744 '2017-08-07 09:59:14' 61
    13745 '2017-08-07 10:03:51' 29
    13746 '2017-08-07 10:05:05' 29
    13747 '2017-08-07 10:05:07' 29
    13748 '2017-08-07 10:05:11' 29
    13749 '2017-08-07 10:05:14' 29
    13750 '2017-08-07 10:05:14' 29
    13751 '2017-08-07 10:09:51' 54
    13752 '2017-08-07 10:11:05' 54
    13753 '2017-08-07 10:11:07' 54",
                 header = TRUE, stringsAsFactors = FALSE)
www
  • 38,575
  • 12
  • 48
  • 84