R sum consecutive duplicate rows and remove all but first

Question

I am stuck with a probably simple question - how to sum consecutive duplicate rows and remove all but first row. And, if there is a NA in between two duplicates (such as 2,na,2) , also sum them and remove all but the first entry. So far so good, here is my sample data

ia<-c(1,1,2,NA,2,1,1,1,1,2,1,2)
time<-c(4.5,2.4,3.6,1.5,1.2,4.9,6.4,4.4, 4.7, 7.3,2.3, 4.3)
a<-as.data.frame(cbind(ia, time))

sample output

Now I want to 1.) sum the "time" column of consecutive ia's - i.e., sum the time if the number 1 occurs twice or more right after each other, in my case here sum first and second row of column time to 4.5+2.4.

2.) if there is a NA in between two numbers (ia column) which are the same (i.e., ia = 2, NA, 2), then also sum all of those times.

3.) keep only first occurence of the ia, and delete the rest.

In the end, I would want to have something like this:

 a
       ia time
    1   1  6.9
    3   2  6.3
    6   1  20.4
    10  2  7.3
    11  1  2.3
    12  2  4.3

I found this for summing, but it does not take into account the consecutive factor

aggregate(time~ia,data=a,FUN=sum)

and I found this for deleting

a[cumsum(rle(as.numeric(a[,1]))$lengths),]

although the rle approach keeps the last entry, and I would want to keep the first. I also have no idea how to handle the NAs.

if I have a pattern of 1-NA-2 then the NA should NOT be counted with either of them, in this case the NA row should be removed.

What if you have a pattern of `1-NA-2`? Should NA be counted with 1, with 2, or with neither? Also, are there ever more than one NA in a row? — David Robinson, Sep 15 '15 at 14:21
if I have a pattern of 1-NA-2 then the NA should NOT be counted with either of them, in this case the NA should be removed. Yes, there also may be more then one NA in a row. — LaNeu, Sep 15 '15 at 14:27
To clarify: when I say more than one NA in a row, I mean more than one consecutive NA in sequence. For instance, is it possible that there is `1 NA NA 1` in sequence? And in that case, should the NAs be counted as 1? — David Robinson, Sep 15 '15 at 14:56
it is possible that there is say `1 NA NA NA 1` in sequence. In this case, the NAs should be counted as 1. — LaNeu, Sep 15 '15 at 15:06
Great. My answer below (as well as [Colonel Beauvel's](http://stackoverflow.com/a/32589049/712603)) handle this — David Robinson, Sep 15 '15 at 15:07
great thanks so much - i'll try it with my data now (this was very quick!!) and will mark as solved if I can get it to work! — LaNeu, Sep 15 '15 at 15:12

Colonel Beauvel · Answer 1 · 2015-09-17T14:35:51.213

5

With data.table (as RHertel suggested for na.locf):

library(data.table)
library(zoo)

setDT(a)[na.locf(ia, fromLast=T)==na.locf(ia), sum(time), cumsum(c(T,!!diff(na.locf(ia))))]
#   id   V1
#1:  1  6.9
#2:  2  6.3
#3:  3 20.4
#4:  4  7.3
#5:  5  2.3
#6:  6  4.3

edited Sep 17 '15 at 14:35

answered Sep 15 '15 at 14:45

Colonel Beauvel

30,423
11
47
87

1

This carries values forward even if they are followed by a different value, which violates the behavior described [here](http://stackoverflow.com/questions/32588433/r-sum-consecutive-duplicate-rows-and-remove-all-but-first/32588960#comment53030951_32588433). For example, try this with `a <- data.frame(ia = c(1, NA, 2), time = 1)` – David Robinson Sep 15 '15 at 14:48

score 3 · Accepted Answer · edited May 23 '17 at 12:25

You first need to replace sequences of NAs with the values surrounding them (if they are the same). This answer shows zoo's na.locf function, which fills in NAs with the last observation. By testing whether it's the same when you carry values backwards or forwards, you can filter out the NAs you don't want, then do the carrying forward:

library(dplyr)
library(zoo)

a %>%
  filter(na.locf(ia) == na.locf(ia, fromLast = TRUE)) %>%
  mutate(ia = na.locf(ia))
#>    ia time
#> 1   1  4.5
#> 2   1  2.4
#> 3   2  3.6
#> 4   2  1.5
#> 5   2  1.2
#> 6   1  4.9
#> 7   1  6.4
#> 8   1  4.4
#> 9   2  7.3
#> 10  1  2.3
#> 11  2  4.3

Now that you've fixed those NAs, you can group consecutive sets of values using cumsum. The full solution is:

result <- a %>%
  filter(na.locf(ia) == na.locf(ia, fromLast = TRUE)) %>%
  mutate(ia = na.locf(ia)) %>%
  mutate(change = ia != lag(ia, default = FALSE)) %>%
  group_by(group = cumsum(change), ia) %>%
  summarise(time = sum(time))
result
#> Source: local data frame [6 x 3]
#> Groups: group [?]
#> 
#>   group    ia  time
#>   (int) (dbl) (dbl)
#> 1     1     1   6.9
#> 2     2     2   6.3
#> 3     3     1  15.7
#> 4     4     2   7.3
#> 5     5     1   2.3
#> 6     6     2   4.3

If you want to get rid of the group column, use the additional lines:

result %>%
  ungroup() %>%
  select(-group)

Hi, thanks again for your help. I was wondering how to handle extra columns - say I also have the column z in the dataframe z<-c(1,1,1,1,1,1,1,1,1,1,1,1) which should not be summed or anything, only the rows should be deleted as above (i.e., along with the consecutive ia's) ..Do you know how to handle this? — LaNeu, Sep 16 '15 at 13:14
When i run this example i do not get the corresponding output. I only get one value: the summarised time. What am i doing wrong? — team17, Apr 23 '17 at 13:53
@team17 That's odd, I can't reproduce that. What version of dplyr are you using? (And if it's a dev version from GitHub, when did you install it?) — David Robinson, Apr 23 '17 at 15:14

Pierre L · Answer 3 · 2015-09-15T17:39:43.287

2

nas <- which(is.na(df$ia))
add.index <- sapply(nas, function(x) {logi <- which(as.logical(df$ia))
  aft <- logi[logi > x][1]
  fore <- tail(logi[logi< x], 1)
  if(df$ia[aft] == df$ia[fore]) aft else NA})
df$ia[nas] <- df$ia[add.index]
df <- df[complete.cases(df),]

First we determine if the NA values of the column are surrounded by the same value. If yes, the surrounding value replaces the NA. There is no problem if the data has consecutive NA values.

Next we do a standard sum by group operation. cumsum allows us to create a unique group based on changes in the numbers.

df$grps <- cumsum(c(F, !df$ia[-length(df$ia)] == df$ia[-1]))+1
aggregate(time ~ grps, df, sum)
#   grps time
# 1    1  6.9
# 2    2  6.3
# 3    3 20.4
# 4    4  7.3
# 5    5  2.3
# 6    6  4.3

This is a base R approach. With packages like dplyr, zoo, or data.table different options are available as they come built with specialized functions to do what we did here.

edited Sep 15 '15 at 17:39

answered Sep 15 '15 at 14:33

Pierre L

28,203
6
47
69

This solution does not work if there is more than one NA in a row. – David Robinson Sep 15 '15 at 14:42
@DavidRobinson I tested your function with two NA values in row 4. For value `2` NA is returned. As per OP's desired output, that would not happen. With my solution `4.8` is returned. – Pierre L Sep 15 '15 at 14:49
My miscommunication about the word "row". I don't mean more than two NAs in one row of a data frame; I mean more than two consecutive NA's in sequence in the `ia` column- for instance, if both `a$ia[4]` and `a$ia[5]` are NA. (That is, [the idiom "in a row"](http://idioms.thefreedictionary.com/in+a+row)) – David Robinson Sep 15 '15 at 14:50
thanks a lot for your help!! However, if I have `ia<-c(1,1,2,NA,2,1,NA,NA,1,2,1,2)` and then the same time, it does not corretly calculate the sum for the second set of consecutive NAs - in this particular example, it calculates the `1+NA`, removes the second `NA` and then goes to the next number. – LaNeu Sep 15 '15 at 15:27
It has been updated. I understand an answer has been chosen, but I'm curious to know if this works for you. – Pierre L Sep 15 '15 at 15:38

R sum consecutive duplicate rows and remove all but first

3 Answers3

Linked