2

I have a dataset that looks something like this:

name = c("john", "john", "john", "alex","alex", "tim", "tim", "tim", "ralph", "ralph")
year = c(2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012, 2014, 2016)
my_data = data.frame(name, year)

    name year
1   john 2010
2   john 2011
3   john 2012
4   alex 2011
5   alex 2012
6    tim 2010
7    tim 2011
8    tim 2012
9  ralph 2014
10 ralph 2016

I want to count the two following things in this dataset:

    1. Groups based on all years
    1. And of these groups, the number of groups with at least one non-consecutive year

As an example for 1):

# sample output for 1)

              year count
1 2010, 2011, 2012     2
2       2011, 2012     1
3       2014, 2016     1

And as an example of 2) - only row 3 (in the above data frame) contains a missing year (i.e. 2014 to 2016 without 2015). Thus, the output would look something like this:

# sample output for 2)

              year count
1       2014, 2016     1

Can someone please show me how to do this in R? And is there a way to make sure that (2011, 2012) is considered the same as (2012, 2011) ? 

EDIT: For anyone using an older version of R, @Rui Barradas provided an answer for 2) - I have included it here so that there is no ambiguity when copy/pasting:

agg <- aggregate(year ~ name, my_data, c)
agg <- agg$year[sapply(agg$year, function(y) any(diff(y) != 1))]
as.data.frame(table(sapply(agg, paste, collapse = ", ")))
stats_noob
  • 5,401
  • 4
  • 27
  • 83

3 Answers3

3

Here are base R solutions.

# 1.
agg <- aggregate(year ~ name, my_data, paste, collapse = ", ")
as.data.frame(table(agg$year))
#>               Var1 Freq
#> 1 2010, 2011, 2012    2
#> 2       2011, 2012    1
#> 3       2014, 2016    1

# 2.
agg <- aggregate(year ~ name, my_data, c)
agg <- agg$year[sapply(agg$year, \(y) any(diff(y) != 1))]
as.data.frame(table(sapply(agg, paste, collapse = ", ")))
#>         Var1 Freq
#> 1 2014, 2016    1

# final clean up
rm(agg)  

Created on 2022-12-03 with reprex v2.0.2


Edit

Answering to the comment/request,

Is there a way to make sure that (2011, 2012) is considered the same as (2012, 2011) ?

a way is to, in each group of name, first sort the data by year. Then run the code above.

my_data <- my_data[order(my_data$name, my_data$year), ]
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
1

Here is solution with dplyr and tidyr:

library(dplyr)
library(tidyr)

### 1.
my_data %>% 
  group_by(name) %>%  
  mutate(year = toString(year)) %>% 
  distinct(year) %>% 
  ungroup() %>% 
  count(year, name="count")

year             count
<chr>            <int>
1 2010, 2011, 2012     2
2 2011, 2012           1
3 2014, 2016           1

### 2. 
my_data %>% 
  group_by(name) %>% 
  mutate(x = lead(year) - year) %>% 
  fill(x, .direction = "down") %>% 
  ungroup () %>% 
  filter(x >= max(x)) %>% 
  mutate(year = toString(year)) %>% 
  distinct(year) %>% 
  ungroup() %>% 
  count(year, name="count")

year       count
<chr>      <int>
1 2014, 2016     1
TarJae
  • 72,363
  • 6
  • 19
  • 66
  • 1
    Is there a way to make sure that (2011, 2012) is considered the same as (2012, 2011) ? Thank you so much – stats_noob Dec 03 '22 at 08:04
  • 1
    We could use arrange after group_by like: `my_data %>% group_by(name) %>% arrange(year, .by_group = TRUE) %>% mutate(year = toString(year)) %>% distinct(year) %>% ungroup() %>% count(year, name="count")` – TarJae Dec 03 '22 at 08:20
1

Using toString and sort in by.

by(my_data$year, my_data$name, \(x) toString(sort(x))) |> table() |> as.data.frame()
#               Var1 Freq
# 1 2010, 2011, 2012    2
# 2       2011, 2012    1
# 3       2014, 2016    1

Order doesn't matter:

set.seed(42)
my_data <-  my_data[sample(nrow(my_data)), ]
by(my_data$year, my_data$name, \(x) toString(sort(x))) |> table() |> as.data.frame()
#               Var1 Freq
# 1 2010, 2011, 2012    2
# 2       2011, 2012    1
# 3       2014, 2016    1

Data:

my_data <- structure(list(name = c("john", "john", "john", "alex", "alex", 
"tim", "tim", "tim", "ralph", "ralph"), year = c(2010, 2011, 
2012, 2011, 2012, 2010, 2011, 2012, 2014, 2016)), class = "data.frame", row.names = c(NA, 
-10L))
jay.sf
  • 60,139
  • 8
  • 53
  • 110