The main dplyr
verbs (e.g., mutate()
, arrange()
, etc) always return dataframes. So if you are looking to do some kind of operation that results in an operation, you are correct that a dplyr
-centric approach is probably a good place to start. Base R functions are often vector-centric, so something like table()
will often require additional steps afterward, if you want a dataframe in the end.
Once you've committed to dplyr
, You have at least two options for this particular dilemma:
Option 1
The count()
function gets you there in one step.
df %>%
count(composerName) %>%
arrange(-n) # to bring the highest count to the top
Option 2
Although it is one more line, I personally prefer the more verbose option because it helps me see what is happening more easily.
df %>%
group_by(composerName) %>%
summarise(n = n()) %>%
arrange(-n) # to bring the highest count to the top
It has the added benefit that I can role right into additional summarize()
commands that I might care about too.
df %>%
group_by(composerName) %>%
summarise(
n = n(),
n_sq = n^2) # a little silly here, but often convenient in other contexts
Consider data.table for large datasets
EDIT: I would be remiss if I failed to mention the data.table
might be worth looking into for this larger dataset. Although dplyr
is optimized for readibility, it often slows down with datasets with more than 100k rows. In contrast, the data.table
package is designed for speed with large datasets. If you are an R-focused person who often runs into large datasets, it's worth the time to look into. Here is a good comparison