10

I have a data table of this form (2000000+ rows, 1000+groups):

set.seed(1)    
dt <- data.table(id = rep(1:3, each = 5), values = sample(c("a", "b","c"), 15, TRUE))

> dt
    id values
 1:  1      a
 2:  1      c
 3:  1      a
 4:  1      b
 5:  1      a
 6:  2      c
 7:  2      c
 8:  2      b
 9:  2      b
10:  2      c
11:  3      c
12:  3      a
13:  3      a
14:  3      a
15:  3      b

I want to, within each ID group, replace the entire sequence of character "a", that precedes the character "b", and I want to replace them with "b". So the condition is that if "a" or a sequence of "a"s appear before "b", replace all the "a"s. (actually, in my real table, it's when "b" is preceded by "a","x", or"y", preceding character should be replaced, but I should be able to generalize)

In the example above,the value of "a" in row 3 should be replaced (easy to do with (shift) in data.table), as well as all the "a"s in rows 12-14 (not sure how to do). So, the desired output is this:

> dt
    id values
 1:  1      a
 2:  1      c
 3:  1      b
 4:  1      b
 5:  1      a
 6:  2      c
 7:  2      c
 8:  2      b
 9:  2      b
10:  2      c
11:  3      c
12:  3      b
13:  3      b
14:  3      b
15:  3      b

What comes to my mind is looping from the last index, but I am not exactly sure how to do that with if I have multiple groupings (say, ID and DATE), and anyway, this doesn't seem to be the fastest dt solution.

Henrik
  • 65,555
  • 14
  • 143
  • 159
Djpengo
  • 379
  • 2
  • 14

4 Answers4

5

Here's another data.table approach:

dt[, x := rleid(values), by = .(id)]
dt[dt[values == "b", .(id, x=x-1, values="a")], 
   on = .(id, x, values), 
   values := "b"
   ][, x := NULL]
  • create a new column "x" with the run length ids per value grouped by id
  • join on itself while modifying the run length ids (x) to be the preceeding value and values to be "a" (the specific value you want to change), then update values with "b"
  • delete column x afterwards

The result is:

dt
#     id values
#  1:  1      a
#  2:  1      c
#  3:  1      b
#  4:  1      b
#  5:  1      a
#  6:  2      c
#  7:  2      c
#  8:  2      b
#  9:  2      b
# 10:  2      c
# 11:  3      c
# 12:  3      b
# 13:  3      b
# 14:  3      b
# 15:  3      b

And here's a generalization to the case where you want to replace values "a", "x", or "y" followed by "b" with "b":

dt[, x := rleid(values), by = .(id)]
dt[dt[values == "b", .(values=c("a", "x", "y")), by = .(id, x=x-1)], 
   on = .(id, x, values), 
   values := "b"
   ][, x := NULL]
talat
  • 68,970
  • 21
  • 126
  • 157
4

Late to the party and several nice run length alternatives were already provided ;) So here I try nafill instead.

(1) Create a variable 'v2' which is NA when 'values' are "a". (2) Fill missing values by next observation carried backward. (3) When the original 'values' are "a" and the corresponding filled values in 'v2' are "b", update 'v' with 'v2'.

# 1
dt[values != "a" , v2 := values]

# 2
d1[, v2 := v2[nafill(replace(seq_len(.N), is.na(v2), NA), type = "nocb")], by = id]

# 3
dt[values == "a" & v2 == "b", values := v2]

# clean-up
dt[ , v2 := NULL]

Currently, nafill only works with numeric variables, hence replace step in chunk # 2 (modified from @chinsoon12 in the issue nafill, setnafill for character, factor and other types).

The NA replacement code may be slightly shortened by using zoo::nalocf:

dt[, v2 := zoo::na.locf(v2, fromLast = TRUE, na.rm = FALSE), by = id]

However, note that na.locf is slower.


When comparing the answers on larger data (data.table(id = rep(1:1e4, each = 1e4, replace = TRUE), values = sample(c("a", "b", "c"), 1e8, replace = TRUE)), it turns out that this alternative actually is faster than the others.

Henrik
  • 65,555
  • 14
  • 143
  • 159
  • Interesting solution, thanks! I really like it as rle might not be obvious to everyone reading your code. It can be made shorter by using zoo:na.locf (ri indices column in your example is only there because nafill doesn't work with characters I think). See: `dt[values != "a" , v2 := values]` ; `dt[, v2 := zoo::na.locf(v2, fromLast = TRUE, na.rm = FALSE), by = "id"] ` ; `dt[values == "a" & v2 == "b", values := v2]` – Djpengo Jun 11 '20 at 10:32
  • Thanks a lot for your feedback @Djpengo. I tried `na.locf` before posting my answer, but it was considerably slower so I decided to exclude it (sorry, I should have mentioned that in the post). I might put it back. – Henrik Jun 11 '20 at 10:40
  • I guess it's good if it's in there, as it depends what you are looking for, simplicity or speed. With DT it makes sense that one is indeed looking for the speed though – Djpengo Jun 11 '20 at 12:18
2

This is not pretty but I think this is what you are after:

dt[, .N, by = .(id, values = paste0(values, rleid(values)))
   ][, values := sub("[0-9]+", "", values)
     ][, values := fifelse(values == "a" & shift(values, -1L) == "b" & !is.na(shift(values, -1L)), "b", values), by = id
       ][, .SD[rep(seq_len(.N), N)]
         ][, !"N"]

    id values
 1:  1      a
 2:  1      c
 3:  1      b
 4:  1      b
 5:  1      a
 6:  2      c
 7:  2      c
 8:  2      b
 9:  2      b
10:  2      c
11:  3      c
12:  3      b
13:  3      b
14:  3      b
15:  3      b
s_baldur
  • 29,441
  • 4
  • 36
  • 69
1

You can use rle().

Note: To avoid ambiguity, I rename the "values" column to "var" because the rle() function also produces a list containing a vector named "values".

dt[, new := with(rle(var), rep(ifelse(values == "a" & c(values[-1], "") == "b", "b", values), lengths)), by = id]
dt

#     id var new
#  1:  1   a   a
#  2:  1   c   c
#  3:  1   a   b
#  4:  1   b   b
#  5:  1   a   a
#  6:  2   c   c
#  7:  2   c   c
#  8:  2   b   b
#  9:  2   b   b
# 10:  2   c   c
# 11:  3   c   c
# 12:  3   a   b
# 13:  3   a   b
# 14:  3   a   b
# 15:  3   b   b
Darren Tsai
  • 32,117
  • 5
  • 21
  • 51
  • Thanks! Looks good, could you maybe help me understand how is lengths element from rle() used here, been trying to understand it for some time... – Djpengo Jun 09 '20 at 17:08
  • @Djpengo For example, you can run `x <- c("a", "a", "b", "b", "b") ; rle(x)`. It returns a list of 2 vectors, one is `values` and the other is `lengths`. `lengths` means the counts of elements in `values`. If you run `with(rle(x), rep(values, lengths))`, the output will be the same as the original `x` because elements in `values` are repeated `lengths` times. – Darren Tsai Jun 09 '20 at 17:26
  • 1
    Thanks, a very neat solution! For anyone reading this, dt[, rle(var), by = id] basically creates a table showing sequences of strings without repetitions, with lengths column showing how many times each string is repeated. This is much simpler case, and everything else follows from there – Djpengo Jun 09 '20 at 17:31
  • This is really neat as it follows the stated condition exactly and I am 100% sure that it should work as expected in a more complex table I am dealing with – Djpengo Jun 09 '20 at 17:32
  • I'll actually use dt$new <- dt[, rle(var), by = "id"][, rep(ifelse(values == "a" & shift(values, type= "lead") == "b", "b", values), lengths)] as I think that that sequence of commands is easiest to understand (important if you revisit your code after some time and forget how you came up with the solution :)) – Djpengo Jun 09 '20 at 17:45
  • 1
    @Djpengo good work. If you use `shift()`, remember to add `fill = ""`. Otherwise, missing values will appear if the final element of `dt$var` is `"a"`. You can set `dt$var[15] <- "a"` and run your code again. – Darren Tsai Jun 09 '20 at 17:54