How to create a table showing the biggest values in a large csv file in R with dplyr?

Question

I have a very big csv file and im trying to find the amount of times a value has been repeated in a column. csv file im using: https://www.kaggle.com/nyphil/perf-history

this is what ive been trying to do.

library(dplyr)
repeatedcomposers<-table(ny_philarmonic$composerName)

this works but only gives me 1000 values instead of the 2767 composers in the dataframe. I also need it to create a separate dataframe so i can use it later.

Try `ny_philarmonic %>% count(composerName)` – MrFlick Jun 04 '21 at 00:31 — MrFlick, Jun 04 '21 at 00:31

Ian Cero · Answer 1 · 2021-06-04T02:51:36.967

The main dplyr verbs (e.g., mutate(), arrange(), etc) always return dataframes. So if you are looking to do some kind of operation that results in an operation, you are correct that a dplyr-centric approach is probably a good place to start. Base R functions are often vector-centric, so something like table() will often require additional steps afterward, if you want a dataframe in the end.

Once you've committed to dplyr, You have at least two options for this particular dilemma:

Option 1

The count() function gets you there in one step.

df %>% 
  count(composerName) %>%
  arrange(-n) # to bring the highest count to the top

Option 2

Although it is one more line, I personally prefer the more verbose option because it helps me see what is happening more easily.

df %>% 
  group_by(composerName) %>% 
  summarise(n = n()) %>%
  arrange(-n) # to bring the highest count to the top

It has the added benefit that I can role right into additional summarize() commands that I might care about too.

df %>% 
  group_by(composerName) %>% 
  summarise(
    n = n(), 
    n_sq = n^2) # a little silly here, but often convenient in other contexts

Consider data.table for large datasets

EDIT: I would be remiss if I failed to mention the data.table might be worth looking into for this larger dataset. Although dplyr is optimized for readibility, it often slows down with datasets with more than 100k rows. In contrast, the data.table package is designed for speed with large datasets. If you are an R-focused person who often runs into large datasets, it's worth the time to look into. Here is a good comparison

How to create a table showing the biggest values in a large csv file in R with dplyr?

1 Answers1

Option 1

Option 2

Consider data.table for large datasets