I have a fairly untidy, large dataset that can be roughly approximated by the following code.
set.seed(1)
col_1 <- c(rep(c(1888:1891), each = 50), rep(c(1892:1895), each = 30))
a <- c('shirt', 'blue', 'red', 'green', 'pants', 'blue', 'red', 'green', 'yellow', 'sweater', 'black', 'orange', 'purple')
b <- rep(a, 30)
col_2 <- b[c(1:320)]
df <- data.frame(col_1, col_2)
Wherein each colour refers to the colour of the last mentioned garment of clothing.
My question to you is how would I go about extracting, on a yearly basis, the different colours that sweaters are available in?
There are a couple of differences with the real data however:
- The real dataset is monthly, however I am only interested in whether or not each colour occurs per year
- The real dataset is far less repetitive, with colours exiting and entering at random each month
- The real dataset contains roughly a dozen different "garments" per month.
I have thought to try something as crude as simply extracting the next ~50 datapoints that follow each "Sweater" occurrence, but I am not even sure how to do this, and was hoping for something cleaner since that would still involve a lot of tidying up, since "Sweater" would occur at least 12 times per year.