R: sum rows from column A until conditioned value in column B

Question

I'm pretty new to R and can't seem to figure out how to deal with what seems to be a relatively simple problem. I want to sum the rows of the column 'DURATION' per 'TRIAL_INDEX', but then only those first rows where the values of 'X_POSITION" are increasing. I only want to sum the first round within a trial where X increases. The first rows of a simplified dataframe:

  TRIAL_INDEX DURATION X_POSITION 
1           1      204      314.5 
2           1      172      471.6             
3           1      186      570.4           
4           1      670      539.5          
5           1      186      503.6          
6           2      134      306.8           
7           2      182      503.3            
8           2      806      555.7           
9           2      323      490.0

So, for TRIAL_INDEX 1, only the first three values of DURATION should be added (204+172+186), as this is where X has the highest value so far (going through the dataframe row by row).

The desired output should look something like:

  TRIAL_INDEX DURATION X_POSITION FIRST_PASS_TIME
1           1      204      314.5             562
2           1      172      471.6             562
3           1      186      570.4             562
4           1      670      539.5             562
5           1      186      503.6             562
6           2      134      306.8            1122
7           2      182      503.3            1122
8           2      806      555.7            1122
9           2      323      490.0            1122

I tried to use dplyr, to generate a new dataframe that can be merged with my original dataframe. However, the code doesn't work, and also I'm not sure on how to make sure it's only adding the first rows per trial that have increasing values for X_POSITION.

FirstPassRT = dat %>% 
          group_by(TRIAL_INDEX) %>%
          filter(dplyr::lag(dat$X_POSITION,1) > dat$X_POSITION) %>% 
          summarise(FIRST_PASS_TIME=sum(DURATION))

Any help and suggestions are greatly appreciated!

Your `X_POSITION` is not numeric, so how do expect R to know if it's increasing or not? I would guess it uses the underlying integer representation (if those factors) and throws warnings. See [here](http://stackoverflow.com/questions/15236440/as-numeric-with-comma-decimal-separators) for instance. — David Arenburg, May 16 '16 at 15:27
Thank David, good point. I changed the X_POSITION values to numeric ones, little beginner's mistake with reading in the data. Still, do not seem to get my desired output... — Saskia, May 16 '16 at 15:52

score 2 · Accepted Answer · answered May 16 '16 at 16:45

library(data.table)
dt = as.data.table(df) # or setDT to convert in place

# find the rows that will be used for summing DURATION
idx = dt[, .I[1]:.I[min(.N, which(diff(X_POSITION) < 0), na.rm = T)], by = TRIAL_INDEX]$V1

# sum the DURATION for those rows
dt[idx, time := sum(DURATION), by = TRIAL_INDEX][, time := time[1], by = TRIAL_INDEX]
dt
#   TRIAL_INDEX DURATION X_POSITION time
#1:           1      204      314.5  562
#2:           1      172      471.6  562
#3:           1      186      570.4  562
#4:           1      670      539.5  562
#5:           1      186      503.6  562
#6:           2      134      306.8 1122
#7:           2      182      503.3 1122
#8:           2      806      555.7 1122
#9:           2      323      490.0 1122

Eddi, this is great, thank you so much! I've been trying to write for loops, which are extremely slow, and this is such a fast and short way. — Saskia, May 16 '16 at 21:00

score 1 · Answer 2 · answered May 16 '16 at 15:54

Here is something you can try with dplyr package:

library(dplyr);
dat %>% group_by(TRIAL_INDEX) %>%  
        mutate(IncLogic = X_POSITION > lag(X_POSITION, default = 0)) %>% 
        mutate(FIRST_PASS_TIME = sum(DURATION[IncLogic])) %>% 
        select(-IncLogic)

Source: local data frame [9 x 4]
Groups: TRIAL_INDEX [2]

  TRIAL_INDEX DURATION X_POSITION FIRST_PASS_TIME
        (int)    (int)      (dbl)           (int)
1           1      204      314.5             562
2           1      172      471.6             562
3           1      186      570.4             562
4           1      670      539.5             562
5           1      186      503.6             562
6           2      134      306.8            1122
7           2      182      503.3            1122
8           2      806      555.7            1122
9           2      323      490.0            1122

If I'm reading OP correctly this is not what they want. Change row 5 value of position to 600 and run this. — eddi, May 16 '16 at 16:33

score 0 · Answer 3 · answered May 16 '16 at 16:03

If you want to summarize it down to one row per trial you can use summarize like this:

library(dplyr)

df <- data_frame(TRIAL_INDEX = c(1,1,1,1,1,2,2,2,2),
                 DURATION = c(204,172,186,670, 186,134,182,806, 323),
                 X_POSITION = c(314.5, 471.6, 570.4, 539.5, 503.6, 306.8, 503.3, 555.7, 490.0))

res <- df %>%
  group_by(TRIAL_INDEX) %>%
  mutate(x.increasing = ifelse(X_POSITION > lag(X_POSITION), TRUE, FALSE),
         x.increasing = ifelse(is.na(x.increasing), TRUE, x.increasing)) %>%
  filter(x.increasing == TRUE) %>%
  summarize(FIRST_PASS_TIME = sum(X_POSITION))
res

#Source: local data frame [2 x 2]
#
#  TRIAL_INDEX FIRST_PASS_TIME
#        (dbl)           (dbl)
#1           1          1356.5
#2           2          1365.8

R: sum rows from column A until conditioned value in column B

3 Answers3