I recently needed to write R code to identify the rows containing district that were visited across all years. There are three main columns to consider:
year
original_district
The name of the district at the time of data collectioncurrent_district
The current names of districts that inhabit the same area originally surveyed
I need my code to be as general as possible so that it can handle any number of years, and it needs to identify rows with any rows with a match between any combination of original_district
and current_district
.
This doesn't sound TOO complicated, but there is an added complication: If a district was split in the time between data collection and now, then the current_district
may contain more than one value. For instance, original_district
may equal Mountain District
, while current_district
might equal "Northern Mountain District;Southern Mountain District"
I am including a solution to identifying the rows with repeating values below. If you can help me identify the exact values that occur across all groups, that would very helpful.
To make the problem a bit more accessible, I created an example using comedy shows attended over multile years, the headliners, and a list of openers.
library(tidyverse)
# Set the random seed for predictability
set.seed(2023)
# Define comics that are 100% visited as a headliner each year.
# Mitch Hedberg is removed after 2005.
always_comics <-
c("Norm MacDonald", "Maria Bamford", "Mitch Hedberg")
# Define the other comics that might be seen
other_comics <-
c("Dave Attell", "John Mulaney", "Tig Notaro", "Eddie Izzard",
"Doug Stanhope", "Jim Jefferies", "Gilbert Gottfried", "Patrice O'Neal",
"Chelsea Peretti", "Jim Gaffigan", "Amy Schumer", "Eddie Murphy", "Sarah Silverman",
"Bill Burr", "Shane Gillis", "Mark Normand", "tony hinchcliffe", "Ellen DeGeneres",
"Nicole Byer", "Michelle Wolf", "Jenny Slate", "Sarah Sherman", "Katherine Ryan") %>%
str_to_title() %>%
unique() %>%
sort()
# Initialize vectors for years and headliners
years <- numeric(0)
headliners <- character(0)
years_of_shows <-
seq(2001, 2019, 4)
for(i in 1:length(years_of_shows)){
# We always see "Norm MacDonald", "Maria Bamford", "Mitch Hedberg" every year,
# but determine a random number of additional shows for each year.
n_shows = sample(2:4, size = 1)
# Concatenate the correct number of years to the years vector
years <- c(years, rep(years_of_shows[i], (3 + n_shows)))
# Concatenate the headliners
headliners <-
c(
headliners,
c(always_comics, sample(other_comics, n_shows)) %>% sample(size = n_shows + 3)
)
}
# Create a dataframe of the headliners and the years attended.
comedy_shows_df <-
tibble(
year = years,
headliner = headliners,
openers = NA_character_
)
# Now add openers. These can be any comic.
all_comics <-
c(always_comics, other_comics) %>% unique()
for(i in 1:nrow(comedy_shows_df)){
# Choose a random number of openers
n_openers = sample(3:5, size = 1, prob = c(1,3,2))
# Sample the openers
openers = sample(all_comics, size = n_openers)
# Make so Mitch headberg always performs alone.
if("Mitch Hedberg" %in% comedy_shows_df$headliner[i]){
openers = "Mitch Hedberg"
}
# Mitch Hedberg died in 2005, so remove him from latter years
if("Mitch Hedberg" %in% openers & comedy_shows_df$year[i] > 2005){
openers = str_replace_all(openers, "Mitch Hedberg", "Norm MacDonald")
}
# The above two IF statement ensure that Mitch Hedberg shouldn't end up in the
# final result.
# If the headliner shows up as an opener, make that person the only opener
# (i.e. they performed alone)
if(comedy_shows_df$headliner[i] %in% openers){
openers = comedy_shows_df$headliner[i]
}
# Save the list of openers as a semi-colon separated string
comedy_shows_df$openers[i] <-
openers %>%
unique() %>%
paste(collapse = ";")
}
# Mitch Hedberg died in 2005, so remove him from latter years
comedy_shows_df %<>%
filter(
!(str_detect(headliner, "Hedberg") & years > 2005)
)
comedy_shows_df %<>%
mutate(
show_id = 1:nrow(.),
.before = 1
)
comedy_shows_df
Okay, now I have a dataset that you can recreate. Next, let check my solution, or post your own!