How to get the top most number of distinct values from a dataset

Question

I am playing around with the Los Angeles Police Data that I got via the Office of the Mayor's website. From 2017-2018, I am attempting to see what charges and the amount of each specific charge were given out in Council District 5. CHARGE and CITY_COUNCIL_DIST are the two variables/columns I am looking at.

I used table(ArrestData$CHARGE) to count the number of distinct values.

I realized that there are over 2400 unique entries, therefore most of the entries are being omitted. I am wondering if there is code to see which 5 "CHARGES" are being mostly given out by the LAPD.

Additionally, I am attempting to find the top 5 charges in one specific Council District (again, another variable/column), is there code for this?

Aside: How can I add sample data to my post? What are the steps to do so on RStudio? Someone asked me to do this in a previous post, but I am not sure how to do so. They told me to use dput(head(df,n)) but my data is too large, even with using 10 rows. They told me to do it through RScript, but I am not sure what they mean

Here's a post for future reference on how to add data: [Reproducibility](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). — NelsonGon, Jul 31 '19 at 16:59

score 0 · Answer 1 · answered Jul 31 '19 at 17:03

0

I think that using an aggregate function may help here. If your data is just CHARGE and CITY_COUNCIL_DIST, then the code might look something like this:

aggregate(.~CITY_COUNCIL_DIST + CHARGE, ArrestData, count)

I'm not terribly advance at R yet, so that code might need some tweaks with your actual data. Once you have the aggregate, you can order your data:

agg.data[order(agg.data, descending=TRUE),]

I'm really no help with dput, sorry!

answered Jul 31 '19 at 17:03

Litmon

247
3
18

Also, I recently posted a question and was confused about dput. Here's the thread I found and followed: https://stackoverflow.com/questions/49994249/example-of-using-dput. – Litmon Jul 31 '19 at 17:09

chadNoliver · Answer 2 · 2019-07-31T19:19:02.173

Posting a reference to the actual dataset/sample data will be helpful to creating a solution. This will help the post adhere to the reproducibility standards that others have mentioned. For the sake of this example we will explicitly create a dataset.

ArrestData <- data.frame(
  CHARGE=c("CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA",
           "CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA",
           "CHARGEB","CHARGEB","CHARGEB","CHARGEB","CHARGEB","CHARGEB","CHARGEB","CHARGEB",
           "CHARGEB","CHARGEB","CHARGEB","CHARGEB","CHARGEB","CHARGEB","CHARGEB","CHARGEB",
           "CHARGEC","CHARGEC","CHARGEC","CHARGEC","CHARGEC","CHARGEC","CHARGEC",
           "CHARGEC","CHARGEC","CHARGEC","CHARGEC","CHARGEC","CHARGEC","CHARGEC",
           "CHARGED","CHARGED","CHARGED","CHARGED","CHARGED","CHARGED",
           "CHARGED","CHARGED","CHARGED","CHARGED","CHARGED","CHARGED",
           "CHARGEE","CHARGEE","CHARGEE","CHARGEE","CHARGEE",
           "CHARGEE","CHARGEE","CHARGEE","CHARGEE","CHARGEE",
           "CHARGEF","CHARGEF","CHARGEF","CHARGEF",
           "CHARGEF","CHARGEF","CHARGEF","CHARGEF",
           "CHARGEG","CHARGEG","CHARGEG",           
           "CHARGEG","CHARGEG","CHARGEG",
           "CHARGEH","CHARGEH",
           "CHARGEH","CHARGEH",
           "CHARGEI",
           "CHARGEI"
           ),
  CITY_COUNCIL_DIST=c(0,5)
)

This code should work, assuming that your dataset is named ArrestData and your CHARGE/CITY_COUNCIL_DIST are also named as stated. The below code will include the top 5 CHARGE's by CITY_COUNCIL_DIST for all CITY_COUNCIL_DIST.

#install these packages if you do not have them

install.packages("magrittr")
install.packages("dplyr")

#make sure these libraries are present
library(magrittr)
library(dplyr)

ArrestData %>% 
  group_by(CHARGE, CITY_COUNCIL_DIST) %>%
  summarize(count=n()) %>% 
  arrange(CITY_COUNCIL_DIST, desc(count)) %>%
  group_by(CITY_COUNCIL_DIST) %>% 
  mutate(rank = rank(desc(count), ties.method="min")) %>% 
  filter(rank<=5)

In order to filter out only the results for CITY_COUNCIL_DIST 5, you will need to change the filter statement to something like the following:(depending on what your CITY_COUNCIL_DIST values actually are)

filter(rank<=5, CITY_COUNCIL_DIST==5)

How to get the top most number of distinct values from a dataset

2 Answers2