0

This question is similar to selecting the top N values within a group by column here.

However, I want to select the last N values by group, with N depending on the value of a corresponding count column. The count represents the number of occurrences of a specific name. If count >3, I only want the last three entries but if it is less than 3, I only want the last entry.

# Sample data
df <- data.frame(Name = c("x","x","x","x","y","y","y","z","z"), Value = c(1,2,3,4,5,6,7,8,9))

# Obtain count for each name
count <- df %>%
  group_by(Name) %>%
  summarise(Count = n_distinct(Value))

# Merge dataframe with count
merge(df, count, by=c("Name"))

# Delete the first entry for x and the first entry for z

# Desired output
data.frame(Name = c("x","x","x","y","y","y","z"), Value = c(2,3,4,5,6,7,9))
Frank
  • 66,179
  • 8
  • 96
  • 180
Anna
  • 825
  • 1
  • 11
  • 19

4 Answers4

4

Another dplyrish way:

df %>% group_by(Name) %>% slice(tail(row_number(), 
  if (n_distinct(Value) < 3) 1 else 3
))

# A tibble: 7 x 2
# Groups:   Name [3]
    Name Value
  <fctr> <dbl>
1      x     2
2      x     3
3      x     4
4      y     5
5      y     6
6      y     7
7      z     9

The analogue in data.table is...

library(data.table)
setDT(df)
df[, tail(.SD, if (uniqueN(Value) < 3) 1 else 3), by=Name]

The closest thing in base R is...

with(df, {
  len = tapply(Value, Name, FUN = length)
  nv  = tapply(Value, Name, FUN = function(x) length(unique(x)))
  df[ sequence(len) > rep(nv - ifelse(nv < 3, 1, 3), len), ]
})

... which is way more difficult to come up with than it should be.

Frank
  • 66,179
  • 8
  • 96
  • 180
3

Another possibility:

library(tidyverse)

df %>%
  split(.$Name) %>%
  map_df(~ if (n_distinct(.x) >= 3) tail(.x, 3) else tail(.x, 1))

Which gives:

#  Name Value
#1    x     2
#2    x     3
#3    x     4
#4    y     5
#5    y     6
#6    y     7
#7    z     9
Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
2

In base R, split the df by df$Name first. Then, for each subgroup, check number of rows and extract last 3 or last 1 row conditionally.

do.call(rbind, lapply(split(df, df$Name), function(a)
    a[tail(sequence(NROW(a)), c(3,1)[(NROW(a) < 3) + 1]),]))

Or

do.call(rbind, lapply(split(df, df$Name), function(a)
    a[tail(sequence(NROW(a)), ifelse(NROW(a) < 3, 1, 3)),]))
#    Name Value
#x.2    x     2
#x.3    x     3
#x.4    x     4
#y.5    y     5
#y.6    y     6
#y.7    y     7
#z      z     9

For three conditional values

do.call(rbind, lapply(split(df, df$Name), function(a)
      a[tail(sequence(NROW(a)), ifelse(NROW(a) >= 6, 6, ifelse(NROW(a) >= 3, 3, 1))),]))
d.b
  • 32,245
  • 6
  • 36
  • 77
  • Thank you @d.b! Is there a way to extend this to split it into three different subgroups? I imagine it would have c(6,3,1) (if a count of 6 were my next break point) but I can't really figure out the NROW input – Anna Jul 31 '17 at 18:14
  • this looks good but for some reason, I'm not getting the expected output. If I have `df <- data.frame(Name = c("x","x","x","x","y","y","y","z","z","a","a","a","a","a","a","a"), Value = c(1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7))` then my expected output is `data.frame(Name = c("x","x","x","y","y","y","z","a","a","a","a","a","a"), Value = c(2,3,4,5,6,7,9,2,3,4,5,6,7))` However, instead I get `data.frame(Name = c("a","a","a","x","x","x","y","y","y","z"), Value = c(5,6,7,2,3,4,5,6,7,9))` – Anna Jul 31 '17 at 18:27
  • 1
    @Anna, try `do.call(rbind, lapply(split(df, df$Name), function(a) a[tail(sequence(NROW(a)), ifelse(NROW(a) >= 6, 6, ifelse(NROW(a) >= 3, 3, 1))),]))` – d.b Jul 31 '17 at 18:29
2

If you're already using dplyr, the natural approach is:

library(dplyr)

# Sample data
df <- data.frame(Name = c("x","x","x","x","y","y","y","z","z"), 
                 Value = c(1,2,3,4,5,6,7,8,9))

df %>%
  group_by(Name) %>%
  mutate(Count = n_distinct(Value),
         Rank = dense_rank(desc(Value))) %>% 
  filter((Count>= 3 & Rank <= 3) | (Rank==1)) %>%
  select(-c(Count,Rank))

There's no need for a merge since you are just counting and ranking on groups defined by Name. Then, you apply a filter on your count and rank requirements, and (optionally, for clean-up) drop the counts and ranks.

C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134
  • See `?ranking` for your ranking options -- your question didn't make it entirely clear how you would break ties. – C8H10N4O2 Jul 31 '17 at 17:56
  • Thank you @C8H10N4O2! How would I extend this to a filter by three different groups? For example, if I wanted to break by 6, 3, or 1 count, where I take the "floor" of the count. Also, the values are all unique so no need to worry about ties but I appreciate the thoroughness! – Anna Jul 31 '17 at 18:06
  • @Anna you just need to figure out how to write your selection as a logical statement and then put it inside `filter()`. Also, note that this answer will rank your data by Value, and Frank's just takes the last N rows by group. So if you are confident that your dataset is already sorted, you should accept Frank's answer. – C8H10N4O2 Jul 31 '17 at 18:35