1

I have a fairly untidy, large dataset that can be roughly approximated by the following code.

set.seed(1)
col_1 <- c(rep(c(1888:1891), each = 50), rep(c(1892:1895), each = 30))

a <- c('shirt', 'blue', 'red', 'green', 'pants', 'blue', 'red', 'green', 'yellow', 'sweater', 'black', 'orange', 'purple')
b <- rep(a, 30)
col_2 <- b[c(1:320)]

df <- data.frame(col_1, col_2)

Wherein each colour refers to the colour of the last mentioned garment of clothing.

My question to you is how would I go about extracting, on a yearly basis, the different colours that sweaters are available in?

There are a couple of differences with the real data however:

  • The real dataset is monthly, however I am only interested in whether or not each colour occurs per year
  • The real dataset is far less repetitive, with colours exiting and entering at random each month
  • The real dataset contains roughly a dozen different "garments" per month.

I have thought to try something as crude as simply extracting the next ~50 datapoints that follow each "Sweater" occurrence, but I am not even sure how to do this, and was hoping for something cleaner since that would still involve a lot of tidying up, since "Sweater" would occur at least 12 times per year.

zephryl
  • 14,633
  • 3
  • 11
  • 30
Ben Smith
  • 11
  • 1
  • Is that sample really what your data set is like? What's the first column supposed to mean? Why are color and garment type in the same column? Is this sample supposed to represent a year or month or week? Please take the tour: https://stackoverflow.com/tour and edit your question according to these guidelines: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example. – John Polo Feb 08 '23 at 01:52
  • I suggest that you provide a sample of the real data. Your rough approximation is either too rough or too untidy to tidy up. – Edward Feb 08 '23 at 02:20

1 Answers1

0

First pull all the garments out into their own column:

library(dplyr)

garments <- c("shirt", "pants", "sweater")

df2 <- df %>%
  group_by(garment = cumsum(col_2 %in% garments)) %>%
  mutate(garment = first(col_2)) %>%
  ungroup() %>%
  filter(!(col_2 %in% garment))
  
df2
# A tibble: 246 × 3
   col_1 col_2  garment
   <int> <chr>  <chr>  
 1  1888 blue   shirt  
 2  1888 red    shirt  
 3  1888 green  shirt  
 4  1888 blue   pants  
 5  1888 red    pants  
 6  1888 green  pants  
 7  1888 yellow pants  
 8  1888 black  sweater
 9  1888 orange sweater
10  1888 purple sweater
# … with 236 more rows

Then you can filter to distinct values per year for a particular garment, like sweaters:


df2 %>%
  filter(garment == "sweater") %>%
  distinct(col_1, col_2)
# A tibble: 24 × 2
   col_1 col_2 
   <int> <chr> 
 1  1888 black 
 2  1888 orange
 3  1888 purple
 4  1889 orange
 5  1889 purple
 6  1889 black 
 7  1890 black 
 8  1890 orange
 9  1890 purple
10  1891 black 
# … with 14 more rows
zephryl
  • 14,633
  • 3
  • 11
  • 30