Counting Number of Unique Column Values Per Group

Question

I have a dataset that looks something like this:

name = c("john", "john", "john", "alex","alex", "tim", "tim", "tim", "ralph", "ralph")
year = c(2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012, 2014, 2016)
my_data = data.frame(name, year)

    name year
1   john 2010
2   john 2011
3   john 2012
4   alex 2011
5   alex 2012
6    tim 2010
7    tim 2011
8    tim 2012
9  ralph 2014
10 ralph 2016

I want to count the two following things in this dataset:

1. Groups based on all years
1. And of these groups, the number of groups with at least one non-consecutive year

As an example for 1):

# sample output for 1)

              year count
1 2010, 2011, 2012     2
2       2011, 2012     1
3       2014, 2016     1

And as an example of 2) - only row 3 (in the above data frame) contains a missing year (i.e. 2014 to 2016 without 2015). Thus, the output would look something like this:

# sample output for 2)

              year count
1       2014, 2016     1

Can someone please show me how to do this in R? And is there a way to make sure that (2011, 2012) is considered the same as (2012, 2011) ?

EDIT: For anyone using an older version of R, @Rui Barradas provided an answer for 2) - I have included it here so that there is no ambiguity when copy/pasting:

agg <- aggregate(year ~ name, my_data, c)
agg <- agg$year[sapply(agg$year, function(y) any(diff(y) != 1))]
as.data.frame(table(sapply(agg, paste, collapse = ", ")))

Rui Barradas · Accepted Answer · 2022-12-03T11:59:22.350

3

Here are base R solutions.

# 1.
agg <- aggregate(year ~ name, my_data, paste, collapse = ", ")
as.data.frame(table(agg$year))
#>               Var1 Freq
#> 1 2010, 2011, 2012    2
#> 2       2011, 2012    1
#> 3       2014, 2016    1

# 2.
agg <- aggregate(year ~ name, my_data, c)
agg <- agg$year[sapply(agg$year, \(y) any(diff(y) != 1))]
as.data.frame(table(sapply(agg, paste, collapse = ", ")))
#>         Var1 Freq
#> 1 2014, 2016    1

# final clean up
rm(agg)

^{Created on 2022-12-03 with reprex v2.0.2}

Edit

Answering to the comment/request,

Is there a way to make sure that (2011, 2012) is considered the same as (2012, 2011) ?

a way is to, in each group of name, first sort the data by year. Then run the code above.

my_data <- my_data[order(my_data$name, my_data$year), ]

edited Dec 03 '22 at 11:59

answered Dec 03 '22 at 07:19

Rui Barradas

70,273
8
34
66

@ Rui : thank you so much! Is there a way to make sure that (2011, 2012) is considered the same as (2012, 2011) ? Thank you so much! – stats_noob Dec 03 '22 at 08:03
Thank you so much! I can't run the following line of code: agg <- agg$year[sapply(agg$year, \(y) any(diff(y) != 1))] – stats_noob Dec 03 '22 at 17:20
Error: unexpected input .... " ... /" – stats_noob Dec 03 '22 at 17:20
I think I am using an older version of R in which function do not accept "/"? – stats_noob Dec 03 '22 at 17:21
1

@stats_noob Replace `\(y)` by `function(y)`, the backslash is the new notation for the anonymous functions introduced in R4.1. Time to update R? – Rui Barradas Dec 03 '22 at 20:14

score 1 · Answer 2 · answered Dec 03 '22 at 07:38

1

Here is solution with dplyr and tidyr:

library(dplyr)
library(tidyr)

### 1.
my_data %>% 
  group_by(name) %>%  
  mutate(year = toString(year)) %>% 
  distinct(year) %>% 
  ungroup() %>% 
  count(year, name="count")

year             count
<chr>            <int>
1 2010, 2011, 2012     2
2 2011, 2012           1
3 2014, 2016           1

### 2. 
my_data %>% 
  group_by(name) %>% 
  mutate(x = lead(year) - year) %>% 
  fill(x, .direction = "down") %>% 
  ungroup () %>% 
  filter(x >= max(x)) %>% 
  mutate(year = toString(year)) %>% 
  distinct(year) %>% 
  ungroup() %>% 
  count(year, name="count")

year       count
<chr>      <int>
1 2014, 2016     1

answered Dec 03 '22 at 07:38

TarJae

72,363
6
19
66

1

Is there a way to make sure that (2011, 2012) is considered the same as (2012, 2011) ? Thank you so much – stats_noob Dec 03 '22 at 08:04
1

We could use arrange after group_by like: `my_data %>% group_by(name) %>% arrange(year, .by_group = TRUE) %>% mutate(year = toString(year)) %>% distinct(year) %>% ungroup() %>% count(year, name="count")` – TarJae Dec 03 '22 at 08:20

score 1 · Answer 3 · answered Dec 03 '22 at 08:25

Using toString and sort in by.

by(my_data$year, my_data$name, \(x) toString(sort(x))) |> table() |> as.data.frame()
#               Var1 Freq
# 1 2010, 2011, 2012    2
# 2       2011, 2012    1
# 3       2014, 2016    1

Order doesn't matter:

set.seed(42)
my_data <-  my_data[sample(nrow(my_data)), ]
by(my_data$year, my_data$name, \(x) toString(sort(x))) |> table() |> as.data.frame()
#               Var1 Freq
# 1 2010, 2011, 2012    2
# 2       2011, 2012    1
# 3       2014, 2016    1

Data:

my_data <- structure(list(name = c("john", "john", "john", "alex", "alex", 
"tim", "tim", "tim", "ralph", "ralph"), year = c(2010, 2011, 
2012, 2011, 2012, 2010, 2011, 2012, 2014, 2016)), class = "data.frame", row.names = c(NA, 
-10L))

Counting Number of Unique Column Values Per Group

3 Answers3

Edit

Linked