4

Problem:

I want to reset a (1,2) sequence if condition is met (subject changes).
I have for and if loops that will do this but, unsurprisingly, that method is very slow. Any suggestions (e.g., involving the apply family) for a more efficent method?

Current:

  subj odd_even
    a         
    a         
    a         
    b         
    b         
    b         
    b         
    c         
    c         
    c         

Goal:

  subj odd_even
    a      1   
    a      2   
    a      1   
    b      1   
    b      2   
    b      1   
    b      2   
    c      1   
    c      2   
    c      1   

df = data.frame( subj = c("a","a","a","b","b","b","b", "c","c","c"), odd_even = "" )
lnNoam
  • 1,055
  • 11
  • 20
  • This is a sort of clunky approach but you could separate out the df by subject, create `odd_even` for each df, and then `rbind` everything back together – ila Jun 26 '15 at 21:04

3 Answers3

6

I like the sequence function for this:

df$odd_even <- 2L - sequence(table(df$subj)) %% 2L

data.table is another option:

library(data.table)
setDT(df)
df[, odd_evenDT := 2L - seq_along(.I) %% 2L, by = subj]

Benchmarks:

set.seed(42)
df <- data.frame(subj = sort(sample(as.character(1:1e4), 1e5, TRUE)))
DT <- data.table(df)

library(microbenchmark)
microbenchmark(roland1 = 2L - sequence(table(df$subj)) %% 2L,
           roland2 = DT[,2L - seq_along(.I) %% 2L, by = subj],
           roland3 = 2L - sequence(rle(as.integer(df$subj))$lengths) %% 2L,
           jeremy = df %>% group_by(subj) %>%
             mutate(odd_even = 2 - (row_number() %% 2)),
           frank = 2L - ave(as.integer(df$s),df$s,FUN=seq_along) %% 2L, 
           flick = ave(seq_along(df$subj), df$subj, FUN=function(x) rep(c(1,2), length.out=length(x))),
           times = 10, unit = "relative")

# Unit: relative
#     expr      min       lq      mean   median        uq      max neval
#  roland1 5.820459 5.754497 5.0368686 5.404110 4.0853039 4.847161    10
#  roland2 1.110919 1.057952 0.9840653 1.037428 0.7939004 1.176258    10
#  roland3 1.000000 1.000000 1.0000000 1.000000 1.0000000 1.000000    10
#   jeremy 5.024087 4.941366 4.3491117 4.635534 3.5144515 4.277011    10
#    frank 2.036816 1.944603 1.7809168 1.831937 1.6459597 1.607283    10
#    flick 3.655127 3.621457 3.2453089 3.473188 2.7717947 3.198285    10
Frank
  • 66,179
  • 8
  • 96
  • 180
Roland
  • 127,288
  • 10
  • 191
  • 288
  • That `sequence` thing is quite elegant. I guess that `table` will sort by `subj` alphabetically, so it is required not only that `subj` records be contiguous, but also sorted between `subj`s. – Frank Jun 26 '15 at 21:58
  • Correct. Although, you could always do `df$subj <- ordered(df$subj, levels = unique(df$subj))` first if that's an issue. – Roland Jun 26 '15 at 22:02
  • Ah, hadn't seen that function before. I was thinking another alternative would be `rle(as.integer(df$subj))$lengths` instead of `table`. Oh actually, I just tested it (taking out all the `<-` and `:=`) and `roland3 = 2L - sequence(rle(as.integer(df$subj))$lengths) %% 2L` is faster than `roland2` on my computer – Frank Jun 26 '15 at 22:05
  • 1
    @Frank Feel free to edit the benchmarks. Then I won't have to run them. – Roland Jun 26 '15 at 22:13
5

Here's another clunky approach:

df$odd_even <- 2L - ave(as.integer(df$s),df$s,FUN=seq_along) %% 2L

The ave makes a counter within each group. That counter is what we're odd-vs-even testing.

Frank
  • 66,179
  • 8
  • 96
  • 180
  • 1
    I used `ave()` in a slightly different way, but it is a useful function in this case: `ave(seq_along(df$subj), df$subj, FUN=function(x) rep(c(1,2), length.out=length(x)))` – MrFlick Jun 26 '15 at 21:21
  • @MrFlick Oh, that's a nice approach; I forgot about `length.out`. I don't like how `ave` demands a numeric first argument and explicit `FUN=`; makes everything a pain. It always takes three tries for me to get it working. – Frank Jun 26 '15 at 21:23
2

What is the desired behavior if a subj reoccurs later in the dataframe?

If it won't happen, here's a dplyr method:

library(dplyr)

df %>% group_by(subj) %>%
       mutate(odd_even = 2 - (row_number() %% 2))
jeremycg
  • 24,657
  • 5
  • 63
  • 74