How to change a variable at the nth occurrence of a value in another variable?

Question

There is a data.table

library(data.table)
car <- data.table(no = 1:100, turn = sample(1:5,100,replace = TRUE), 
              dis = sample(1:10,100,replace = TRUE))

I want to change "dis" to -1, at the nth occurrence of turn == 3, say the third time that "turn" is 3.

I can select the third row of turn == 3:

car[turn == 3, .SD[3]]

However, I don't manage to update "dis" at this row:

car[turn == 3, .SD[3]][, dis := -1]

A related Q&A: Conditionally replacing column values with data.table.

It is a matter of how to locate that row. Here is an alternative way similar to Henrik' scomment: `car[cumsum(turn == 3) == 3 & (turn == 3), dis := -1]` — mt1022, Aug 28 '17 at 08:52
correct. here's a similar: `car[(x <- turn == 3) & cumsum(x) == 3, dis := -1]` — talat, Aug 28 '17 at 08:56

Henrik · Accepted Answer · 2017-08-28T20:59:48.300

Some alternatives. Use rowid or cumsum to create a counter of rows within groups. Add the counter to your condition in i.

I use a slightly smaller toy data set, just to make it easier to track the changes:

d <- data.table(x = 1:3, y = 1:12)

d[rowid(x) == 3 & x == 3, y := -1]

# @mt1022
d[cumsum(x == 3) == 3 & (x == 3), y := -1]

# @docendo discimus
d[(ix <- x == 3) & cumsum(ix) == 3, y := -1]

Although OP didn't mention speed as an issue, I was still curious to time the different approaches on a larger vector. Unsurprisingly, @Frank's method is the fastest, especially so when the number of unique values to search among increases:

frank << docendo < henrik < mt022

microbenchmark(henrik = d[rowid(x) == 3 & x == 3, y := -1],
               mt1022 = d[cumsum(x == 3) == 3 & (x == 3), y := -1],
               docendo = d[(ix <- x == 3) & cumsum(ix) == 3, y := -1],
               frank = d[d[x == 3, which = TRUE][3], y := -1], unit = "relative")

d <- data.table(x = sample(1:3, 1e6, replace = TRUE), y = 1:1e6)
# Unit: relative
#    expr      min       lq     mean   median       uq      max neval cld
#  henrik 4.417303 4.369407 4.133514 4.319839 4.329658 1.260394   100  b 
#  mt1022 5.461961 5.285562 5.174559 5.186404 5.239738 1.608712   100   c
# docendo 3.572646 3.624369 3.788678 3.589705 3.576637 1.733272   100  b 
#   frank 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000   100 a 

d <- data.table(x = sample(1:30, 1e6, replace = TRUE), y = 1:1e6)
# Unit: relative
#    expr      min       lq     mean   median       uq      max neval cld
#  henrik 22.64881 19.54375 18.81963 18.91335 19.78559 5.507692   100  bc
#  mt1022 24.58258 21.17535 19.84417 20.96256 22.76020 3.625263   100   c
# docendo 19.40044 16.75912 16.23321 16.47953 18.06264 4.234100   100  b 
#   frank  1.00000  1.00000  1.00000  1.00000  1.00000 1.000000   100 a

d <- data.table(x = sample(1:300, 1e6, replace = TRUE), y = 1:1e6)
# Unit: relative
#    expr      min       lq     mean   median       uq       max neval cld
#  henrik 31.81237 32.51122 28.79490 30.35766 28.63560  8.236282   100  b 
#  mt1022 34.71984 35.45341 33.20405 33.57394 31.50914 21.556367   100   c
# docendo 27.99046 28.15855 26.56954 26.60644 25.20044  7.847163   100  b 
#   frank  1.00000  1.00000  1.00000  1.00000  1.00000  1.000000   100 a

# Unit: milliseconds
#    expr       min        lq      mean    median       uq        max neval cld
#  henrik 60.655582 76.455531 83.061266 77.632036 78.57818 203.224042   100   c
#  mt1022 66.701182 84.133034 87.967300 84.937201 85.72464 201.167914   100   c
# docendo 52.938545 67.214360 71.558130 68.003891 68.51897 184.178346   100  b 
#   frank  1.977821  2.494039  2.629852  2.663577  2.76089   3.613905   100 a

Thanks to edit the question.I can study how to ask next time. — Gauss.Y, Aug 28 '17 at 09:17
@docendo discimus I included your nice suggestion in my answer. Please feel free to remove and post separate answer if you want. — Henrik, Aug 28 '17 at 09:33
@Frank ...`which` I didn't think of...Much better! Please post as answer so I can upvote it! — Henrik, Aug 28 '17 at 14:28
Ah, your ping didn't connect / show up in my inbox for some reason. Thanks for adding the benchmark. I guess rowid can be implemented very efficiently. It's essentially the same as `sequence(.Ns)` jumbled, where `.Ns` are the sizes of each group; so it doesn't need to do any computation (like `cumsum` does by taking the running sum). — Frank, Aug 28 '17 at 20:39
@Frank OK, yeah, makes sense. Now I just have to figure out how to save those `74.96846` milliseconds somewhere else. Cheers. — Henrik, Aug 28 '17 at 20:54

score 4 · Answer 2 · edited Aug 28 '17 at 14:46

Here's another way:

car[car[turn == 3, which = TRUE][3], dis := -1 ]

Comment. For an operation like this, you might want to turn verbose on, so you can see if any change was made. For example, if we look for the 111th 3....

car[car[turn == 3, which = TRUE][111], dis := -1, verbose = TRUE ]
# Detected that j uses these columns: dis 
# Assigning to 0 row subset of 100 rows

It shows that 0 rows were edited.

See also Using .I to return row numbers with data.table package on the use of which = TRUE.

How to change a variable at the nth occurrence of a value in another variable?

2 Answers2