0

I recently needed to write R code to identify the rows containing district that were visited across all years. There are three main columns to consider:

  • year
  • original_district The name of the district at the time of data collection
  • current_district The current names of districts that inhabit the same area originally surveyed

I need my code to be as general as possible so that it can handle any number of years, and it needs to identify rows with any rows with a match between any combination of original_district and current_district.

This doesn't sound TOO complicated, but there is an added complication: If a district was split in the time between data collection and now, then the current_district may contain more than one value. For instance, original_district may equal Mountain District, while current_district might equal "Northern Mountain District;Southern Mountain District"

I am including a solution to identifying the rows with repeating values below. If you can help me identify the exact values that occur across all groups, that would very helpful.


To make the problem a bit more accessible, I created an example using comedy shows attended over multile years, the headliners, and a list of openers.

library(tidyverse)
# Set the random seed for predictability
set.seed(2023)

# Define comics that are 100% visited as a headliner each year.
# Mitch Hedberg is removed after 2005.
always_comics <- 
  c("Norm MacDonald", "Maria Bamford", "Mitch Hedberg")

# Define the other comics that might be seen
other_comics <- 
  c("Dave Attell", "John Mulaney", "Tig Notaro", "Eddie Izzard", 
    "Doug Stanhope", "Jim Jefferies", "Gilbert Gottfried", "Patrice O'Neal", 
    "Chelsea Peretti", "Jim Gaffigan", "Amy Schumer", "Eddie Murphy", "Sarah Silverman",
    "Bill Burr", "Shane Gillis", "Mark Normand", "tony hinchcliffe", "Ellen DeGeneres",
    "Nicole Byer", "Michelle Wolf", "Jenny Slate", "Sarah Sherman", "Katherine Ryan") %>% 
  str_to_title() %>%
  unique() %>%
  sort()

# Initialize vectors for years and headliners
years <- numeric(0)
headliners <- character(0)

years_of_shows <- 
  seq(2001, 2019, 4)

for(i in 1:length(years_of_shows)){
  # We always see "Norm MacDonald", "Maria Bamford", "Mitch Hedberg" every year, 
  # but determine a random number of additional shows for each year.
  n_shows = sample(2:4, size = 1)
  
  # Concatenate the correct number of years to the years vector
  years <- c(years, rep(years_of_shows[i], (3 + n_shows)))
  
  # Concatenate the headliners
  headliners <- 
    c(
      headliners, 
      c(always_comics, sample(other_comics, n_shows)) %>% sample(size = n_shows + 3)
    )
}

# Create a dataframe of the headliners and the years attended.
comedy_shows_df <-
  tibble(
    year = years,
    headliner = headliners,
    openers = NA_character_
  )


# Now add openers.  These can be any comic.
all_comics <-
  c(always_comics, other_comics) %>% unique()

for(i in 1:nrow(comedy_shows_df)){
  # Choose a random number of openers
  n_openers = sample(3:5, size = 1, prob = c(1,3,2))
  # Sample the openers
  openers = sample(all_comics, size = n_openers)
  
  # Make so Mitch headberg always performs alone.
  if("Mitch Hedberg" %in% comedy_shows_df$headliner[i]){
    openers = "Mitch Hedberg"
  }
  # Mitch Hedberg died in 2005, so remove him from latter years
  if("Mitch Hedberg" %in% openers & comedy_shows_df$year[i] > 2005){
    openers = str_replace_all(openers, "Mitch Hedberg", "Norm MacDonald")
  }
  # The above two IF statement ensure that Mitch Hedberg shouldn't end up in the 
  # final result.
  
  # If the headliner shows up as an opener, make that person the only opener 
  # (i.e. they performed alone)
  if(comedy_shows_df$headliner[i] %in% openers){
    openers = comedy_shows_df$headliner[i]
  }
  
  # Save the list of openers as a semi-colon separated string
  comedy_shows_df$openers[i] <- 
    openers %>%
    unique() %>%
    paste(collapse = ";")
}

# Mitch Hedberg died in 2005, so remove him from latter years
comedy_shows_df %<>%
  filter(
    !(str_detect(headliner, "Hedberg") & years > 2005)
  )


comedy_shows_df %<>%
  mutate(
    show_id = 1:nrow(.),
    .before = 1
  )

comedy_shows_df

Okay, now I have a dataset that you can recreate. Next, let check my solution, or post your own!

Lewkrr
  • 414
  • 4
  • 13
  • 3
    Can you post some code to generate a small example dataset? – DuckPyjamas Mar 31 '23 at 19:34
  • 2
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Mar 31 '23 at 20:40
  • I've posted a solution, along with code to create a toy dataset to work on. – Lewkrr Apr 01 '23 at 21:06

1 Answers1

0

Solution

My solution involves splitting the semicolon-separated list string, unnesting by that column, pivoting long, and then reducing the result.

library(tidyverse)
comedy_shows_df_long <-
  comedy_shows_df %>%
  # First, split the semi-colon separated list
  mutate(
    openers =
      str_split(openers, pattern = ";")
  ) %>% 
  # Next, unnest the split column
  unnest(openers) %>%
  # After that, since headliner and opener represent 
  # the same thing (a comic), stack the columns using `pivot_longer`
  pivot_longer(
    cols = c(headliner, openers)
  ) %>% 
  rename(
    comedian = value
  )
comedy_shows_df_long
# Next, split the dataframe by comedian and year, then intersect ALL of the 
# results using the `Reduce` function
# This results in the names of the comedians seen every single year.
comics_seen_every_year <-
  Reduce(
    f = intersect, 
    x = 
      split(
        x = comedy_shows_df_long$comedian, 
        f = comedy_shows_df_long$year
      )
  ) %>%
  sort()

comics_seen_every_year

Next, identify the unique shows where these comedians performed. This will show you the show_id, and the repeating comics that performed in that show as either a headliner OR opener.

repeated_comics_with_show_id_df <-
  comedy_shows_df_long %>%
  filter(
    comedian %in% comics_seen_every_year
  ) %>%
  group_by(
    show_id, year
  ) %>%
  summarise(
    comedians_seen = paste(unique(sort(comedian)), collapse = ";")
  ) %>% 
  ungroup() %>% 
  suppressMessages()
repeated_comics_with_show_id_df
filtered_comedy_shows_df <-
  semi_join(
    comedy_shows_df,
    repeated_comics_with_show_id_df,
    by = "show_id"
  )
filtered_comedy_shows_df

Here is a screengrab of the original data frame, comedy_shows_df.

enter image description here

Here is a copy of the reduced result, showing the repeated performers at each show, repeated_comics_with_show_id_df.

enter image description here

And here is the final result, the original data frame filtered to only contain rows which contain a repeating comic, filtered_comedy_shows_df

enter image description here

Lewkrr
  • 414
  • 4
  • 13