How to count factors frequency and organize in a new dataframe in R

Question

I have a dataframe like this:

test1 = data.frame("id" = c("FC01", "FC01", "FC22", "FC03", "FC01"),
                    "product" = c("p01", "p02", "p03", "p01", "p03"),
                    "year" = c("2018", "2017", "2015", "2018", "2016"))

I need to find the IDs that appear more than onc, bought between 2016 and 2018, and know which products they bought and which year. Is it possible to create a new dataframe that showing the ids and how many times they appear and when did this happen? Something like this:

test2 = data.frame("times" = c(3, 1), "id" = c("FC01", "FC03"),
                   "year" = c("2018, 2017, 2016", "2018"))

I used dplyr and tried to group by id and filter every id that appears more than once, but I don't know how to continue to get something like this test2. I appreciate any tips in this regard.

`%>% filter(between(year, 2016, 2018))` only works best when `year` is a numeric, not a factor with string label as you have it. So first convert `test1$year <- as.numeric(as.character(test1$year))` or whatever — smci, Feb 23 '19 at 01:54
Always keep numeric columns like `year` as numeric (unless you have a strong reason to make them factors e.g. custom ordering for plot labels). If the only reason they are factor is because you `read.csv()` without `options('stringsAsFactors'=FALSE)`, then read [this](https://stackoverflow.com/questions/5187745/imported-a-csv-dataset-to-r-but-the-values-becomes-factors) — smci, Feb 23 '19 at 02:31
Happy to help. Welcome to SO! Check out the dplyr, data.table and tidyr tutorials, these packages are incredibly powerful. — smci, Feb 25 '19 at 20:09

smci · Accepted Answer · 2019-02-23T02:33:10.100

test1$year <- as.numeric(as.character(test1$year))

test1 %>% filter(between(year,2016,2018))
      %>% group_by(id)
      %>% summarize(times = n(),
          year = toString(unique(year)))


  id    times year          
  <fct> <int> <chr>         
1 FC01      3 2018 2017 2016
2 FC03      1 2018

Notes:

Getting the times column is easy, we just use the utility function dplyr::n().
For the pasted list of (unique) string names of years, same approach as this answer. toString(...) is cleaner code than paste0(as.character(...), collapse=' ')
Note we must use unique(year) as you might have multiple entries for same year.
In order to be able to filter(between(year, 2016, 2018)), we must first fix up year to be numeric, not a factor (or at minimum, make sure the factor levels are also 2015..2018 so that directly doing as.numeric() works as intended, instead of giving 1..4

How to count factors frequency and organize in a new dataframe in R

1 Answers1