0

I'm having a rough time isolating n-rows before and after a flag by group

I found an answer elsewhere that sort of worked, but was thrown off by groups with less than the scope of rows. For example if the scope was 6 rows but a group only had five observations the query would start including irrelevant observations from a prior group.

Here's some dummy data to reproduce.

x <- c("", "", "", "1", "", "","", "", "", "", "", "1","", "", "", "", "1", "")
y <- c("2", "6", "4", "4", "7", "9","1", "15", "7", "4", "5", "8","6", "1", "2", "4", "6", "16")
z <- c("a", "a", "a", "a", "a", "a","a", "b", "b", "b", "b", "b","b", "b", "c", "c", "c", "c")

a <- as.data.frame(cbind(x, y, z))

  x  y z
1     2 a
2     6 a
3     4 a
4  1  4 a
5     7 a
6     9 a
7     1 a
8    15 b
9     7 b
10    4 b
11    5 b
12 1  8 b
13    6 b
14    1 b
15    2 c
16    4 c
17 1  6 c
18   16 c

Ideally I'd like to have a look something like this:

  x  y z
1     6 a
2     4 a
3  1  4 a
4     7 a
5     9 a
6     1 a
7     4 b
8     5 b
9  1  8 b
10    6 b
11    1 b
12    2 c
13    4 c
14 1  6 c
15   16 c

1 Answers1

1
a[zoo::rollapply(a$x, 5, function(z) "1" %in% z, partial = TRUE),]
#    x  y z
# 2     6 a
# 3     4 a
# 4  1  4 a
# 5     7 a
# 6     9 a
# 10    4 b
# 11    5 b
# 12 1  8 b
# 13    6 b
# 14    1 b
# 15    2 c
# 16    4 c
# 17 1  6 c
# 18   16 c

zoo::rollapply operates on "windows" of numbers at a time. Here, it's five, which means it looks at five values and returns a single value; then shifts one (four of the same, plus one more), and returns a single value; etc.

Because I specified partial=TRUE (necessary when you need the output length to be the same as the input length), the length of values looked at might not be the same as the kernel width (5).

The point is that if I'm looking at five at a time, if one of them is a "1", then we're within 2 rows of a "1", and should be retained.

An important property of the window is alignment, where the default is center. It defines where in the window the results go.

In this case, the windows look like:

#  [1] ""  ""  ""  "1" ""  ""  ""  ""  ""  ""  ""  "1" ""  ""  ""  ""  "1" "" 
1:     nn-------' (partial match)
2:     ----yy--------' (partial)
3:     `-------yy-------'  there is a window in this set of five, so a true ("yy")
4:         `-------yy-------'
5:             `-------yy-------'
6:                 `-------yy-------'
7:                     `-------nn-------' no "1", so a false
... etc
#  [1] ""  ""  ""  "1" ""  ""  ""  ""  ""  ""  ""  "1" ""  ""  ""  ""  "1" "" 

You can see in the first seven windows that the first is discarded (there is not a "1" close enough), we have five true ("yy" in my nomenclature), and then we get a false ("nn") since it does not see a "1".

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • This is really helpful! The partial = true makes a lot of sense. – mccinthenyc Feb 23 '21 at 14:57
  • My one question is related to center > let's say I have an even number for the width like 6. Right now it says that "center" is the 4th observation, but I'd prefer if it was the 3rd observation. Is there any way to do that? I'm trying to read the documentation but can't find much. – mccinthenyc Feb 23 '21 at 14:58
  • NVM @r2evans found the answer here: https://stackoverflow.com/a/32235049/4846798 – mccinthenyc Feb 23 '21 at 15:02
  • There are two alternatives, not just one, for making the output the same length as the input:: use partial=TRUE or fill=NA. – G. Grothendieck Feb 23 '21 at 15:44
  • Hrmmm...yeah sorry @r2evans it doesn't look like this is working now. I looked further down into my dataset and it still looks like the function is "reaching" into another group to get rows to fill the window. – mccinthenyc Feb 23 '21 at 17:33
  • When I set `partial = FALSE` I get even worse results... – mccinthenyc Feb 23 '21 at 17:34
  • You can't use `partial=FALSE` when you need the output to be the same length as the input (as when used within `data.frame` columns). – r2evans Feb 23 '21 at 17:38
  • What if I'm okay with that? When I set `partial = FALSE` it still doesn't provide results that I would anticipate. – mccinthenyc Feb 23 '21 at 17:51
  • If you're okay with that, then that's fine, just (a) don't use a `data.frame`, or (b) don't complain when your data is corrupted due to R's recycling rules. I can't comment on how or why it doesn't work, I have no idea what your real data looks like, what the errors/warnings are, or anything other than what is included in the OP. Saying *"still doesn't provide results that I would anticipate"* is fine as well as unsubstantiated. I'm not saying you're wrong, I'm saying *"with what I've been given, I can offer no further advice"*. – r2evans Feb 23 '21 at 18:22
  • Sorry, I really appreciate all your help on this. Your function is working the way you described and it appears to be doing the same on my end. I think the issue is that I need to "extract" exactly the row of the flag, two prior to the flag, and then three after the flag within each group. Your function says "is there a flag within 6 of me" and while that actually did the trick for most of my data it's not working for cases where a flag is close to the beginning/end of another group. – mccinthenyc Feb 23 '21 at 18:40