0

I have a dataframe like this:

test1 = data.frame("id" = c("FC01", "FC01", "FC22", "FC03", "FC01"),
                    "product" = c("p01", "p02", "p03", "p01", "p03"),
                    "year" = c("2018", "2017", "2015", "2018", "2016"))

I need to find the IDs that appear more than onc, bought between 2016 and 2018, and know which products they bought and which year. Is it possible to create a new dataframe that showing the ids and how many times they appear and when did this happen? Something like this:

test2 = data.frame("times" = c(3, 1), "id" = c("FC01", "FC03"),
                   "year" = c("2018, 2017, 2016", "2018"))

I used dplyr and tried to group by id and filter every id that appears more than once, but I don't know how to continue to get something like this test2. I appreciate any tips in this regard.

smci
  • 32,567
  • 20
  • 113
  • 146
Falves
  • 37
  • 1
  • 8
  • 1
    `%>% filter(between(year, 2016, 2018))` only works best when `year` is a numeric, not a factor with string label as you have it. So first convert `test1$year <- as.numeric(as.character(test1$year))` or whatever – smci Feb 23 '19 at 01:54
  • 1
    Always keep numeric columns like `year` as numeric (unless you have a strong reason to make them factors e.g. custom ordering for plot labels). If the only reason they are factor is because you `read.csv()` without `options('stringsAsFactors'=FALSE)`, then read [this](https://stackoverflow.com/questions/5187745/imported-a-csv-dataset-to-r-but-the-values-becomes-factors) – smci Feb 23 '19 at 02:31
  • This is a golden tip, @smci! Thanks you so much! – Falves Feb 25 '19 at 16:24
  • Happy to help. Welcome to SO! Check out the dplyr, data.table and tidyr tutorials, these packages are incredibly powerful. – smci Feb 25 '19 at 20:09

1 Answers1

1
test1$year <- as.numeric(as.character(test1$year))

test1 %>% filter(between(year,2016,2018))
      %>% group_by(id)
      %>% summarize(times = n(),
          year = toString(unique(year)))


  id    times year          
  <fct> <int> <chr>         
1 FC01      3 2018 2017 2016
2 FC03      1 2018  

Notes:

  • Getting the times column is easy, we just use the utility function dplyr::n().
  • For the pasted list of (unique) string names of years, same approach as this answer. toString(...) is cleaner code than paste0(as.character(...), collapse=' ')
  • Note we must use unique(year) as you might have multiple entries for same year.
  • In order to be able to filter(between(year, 2016, 2018)), we must first fix up year to be numeric, not a factor (or at minimum, make sure the factor levels are also 2015..2018 so that directly doing as.numeric() works as intended, instead of giving 1..4
smci
  • 32,567
  • 20
  • 113
  • 146