Conditionally selecting last N values within a group by another column using R

Question

This question is similar to selecting the top N values within a group by column here.

However, I want to select the last N values by group, with N depending on the value of a corresponding count column. The count represents the number of occurrences of a specific name. If count >3, I only want the last three entries but if it is less than 3, I only want the last entry.

# Sample data
df <- data.frame(Name = c("x","x","x","x","y","y","y","z","z"), Value = c(1,2,3,4,5,6,7,8,9))

# Obtain count for each name
count <- df %>%
  group_by(Name) %>%
  summarise(Count = n_distinct(Value))

# Merge dataframe with count
merge(df, count, by=c("Name"))

# Delete the first entry for x and the first entry for z

# Desired output
data.frame(Name = c("x","x","x","y","y","y","z"), Value = c(2,3,4,5,6,7,9))

you should include `library(dplyr)` and tag your question accordingly — C8H10N4O2, Jul 31 '17 at 17:43

Frank · Answer 1 · 2017-07-31T18:36:43.977

4

Another dplyrish way:

df %>% group_by(Name) %>% slice(tail(row_number(), 
  if (n_distinct(Value) < 3) 1 else 3
))

# A tibble: 7 x 2
# Groups:   Name [3]
    Name Value
  <fctr> <dbl>
1      x     2
2      x     3
3      x     4
4      y     5
5      y     6
6      y     7
7      z     9

The analogue in data.table is...

library(data.table)
setDT(df)
df[, tail(.SD, if (uniqueN(Value) < 3) 1 else 3), by=Name]

The closest thing in base R is...

with(df, {
  len = tapply(Value, Name, FUN = length)
  nv  = tapply(Value, Name, FUN = function(x) length(unique(x)))
  df[ sequence(len) > rep(nv - ifelse(nv < 3, 1, 3), len), ]
})

... which is way more difficult to come up with than it should be.

edited Jul 31 '17 at 18:36

answered Jul 31 '17 at 18:21

Frank

66,179
8
96
180

@C8H10N4O2 It assumes the OP has it sorted as they wish already, since they just ask for "the last N values by group". – Frank Jul 31 '17 at 18:30
OK, then this is definitely the best solution. – C8H10N4O2 Jul 31 '17 at 18:36
Thanks :) Conditional on the need for sorting, I agree that yours is the way to go. – Frank Jul 31 '17 at 18:37

Steven Beaupré · Answer 2 · 2017-07-31T18:50:04.287

3

Another possibility:

library(tidyverse)

df %>%
  split(.$Name) %>%
  map_df(~ if (n_distinct(.x) >= 3) tail(.x, 3) else tail(.x, 1))

Which gives:

#  Name Value
#1    x     2
#2    x     3
#3    x     4
#4    y     5
#5    y     6
#6    y     7
#7    z     9

edited Jul 31 '17 at 18:50

answered Jul 31 '17 at 18:38

Steven Beaupré

21,343
7
57
77

2

This is starting to get esoteric... I like it though – C8H10N4O2 Jul 31 '17 at 18:39
1

Could use an `n_distinct` if OP's code is to be believed. – Frank Jul 31 '17 at 18:44
1

@Frank Oh, yes. I thought it was not an issue from OP's comment: *Also, the values are all unique so no need to worry about ties but I appreciate the thoroughness!* – Steven Beaupré Jul 31 '17 at 18:49

d.b · Accepted Answer · 2017-07-31T19:38:11.000

2

In base R, split the df by df$Name first. Then, for each subgroup, check number of rows and extract last 3 or last 1 row conditionally.

do.call(rbind, lapply(split(df, df$Name), function(a)
    a[tail(sequence(NROW(a)), c(3,1)[(NROW(a) < 3) + 1]),]))

Or

do.call(rbind, lapply(split(df, df$Name), function(a)
    a[tail(sequence(NROW(a)), ifelse(NROW(a) < 3, 1, 3)),]))
#    Name Value
#x.2    x     2
#x.3    x     3
#x.4    x     4
#y.5    y     5
#y.6    y     6
#y.7    y     7
#z      z     9

For three conditional values

do.call(rbind, lapply(split(df, df$Name), function(a)
      a[tail(sequence(NROW(a)), ifelse(NROW(a) >= 6, 6, ifelse(NROW(a) >= 3, 3, 1))),]))

edited Jul 31 '17 at 19:38

answered Jul 31 '17 at 17:40

d.b

32,245
6
36
77

Thank you @d.b! Is there a way to extend this to split it into three different subgroups? I imagine it would have c(6,3,1) (if a count of 6 were my next break point) but I can't really figure out the NROW input – Anna Jul 31 '17 at 18:14
this looks good but for some reason, I'm not getting the expected output. If I have `df <- data.frame(Name = c("x","x","x","x","y","y","y","z","z","a","a","a","a","a","a","a"), Value = c(1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7))` then my expected output is `data.frame(Name = c("x","x","x","y","y","y","z","a","a","a","a","a","a"), Value = c(2,3,4,5,6,7,9,2,3,4,5,6,7))` However, instead I get `data.frame(Name = c("a","a","a","x","x","x","y","y","y","z"), Value = c(5,6,7,2,3,4,5,6,7,9))` – Anna Jul 31 '17 at 18:27
1

@Anna, try `do.call(rbind, lapply(split(df, df$Name), function(a) a[tail(sequence(NROW(a)), ifelse(NROW(a) >= 6, 6, ifelse(NROW(a) >= 3, 3, 1))),]))` – d.b Jul 31 '17 at 18:29

score 2 · Answer 4 · answered Jul 31 '17 at 17:53

2

If you're already using dplyr, the natural approach is:

library(dplyr)

# Sample data
df <- data.frame(Name = c("x","x","x","x","y","y","y","z","z"), 
                 Value = c(1,2,3,4,5,6,7,8,9))

df %>%
  group_by(Name) %>%
  mutate(Count = n_distinct(Value),
         Rank = dense_rank(desc(Value))) %>% 
  filter((Count>= 3 & Rank <= 3) | (Rank==1)) %>%
  select(-c(Count,Rank))

There's no need for a merge since you are just counting and ranking on groups defined by Name. Then, you apply a filter on your count and rank requirements, and (optionally, for clean-up) drop the counts and ranks.

answered Jul 31 '17 at 17:53

C8H10N4O2

18,312
8
98
134

See `?ranking` for your ranking options -- your question didn't make it entirely clear how you would break ties. – C8H10N4O2 Jul 31 '17 at 17:56
Thank you @C8H10N4O2! How would I extend this to a filter by three different groups? For example, if I wanted to break by 6, 3, or 1 count, where I take the "floor" of the count. Also, the values are all unique so no need to worry about ties but I appreciate the thoroughness! – Anna Jul 31 '17 at 18:06
@Anna you just need to figure out how to write your selection as a logical statement and then put it inside `filter()`. Also, note that this answer will rank your data by Value, and Frank's just takes the last N rows by group. So if you are confident that your dataset is already sorted, you should accept Frank's answer. – C8H10N4O2 Jul 31 '17 at 18:35

Conditionally selecting last N values within a group by another column using R

4 Answers4

Linked