How to find totals of different categories in same dataset?

Question

I'm a student doing exploratory analysis/data vis with this hate crime data set. I am trying to create a matrix of the different categories (i.e. race, religion, etc.) within from my dataset (hate_crime) during 2009 and 2017. The full dataset can be found here.

I extracted the necessary data (incidents during 2009 or 2017) from the existing data.

SecondYear_OTYear <- hate_crime %>% filter(hate_crime$DATA_YEAR == "2017" | hate_crime$DATA_YEAR == "2009")

Then, I just made different subsets for each subcategory in the category. For example, to create subsets of bias descriptions I made the following:

antiWhiteSubset <- SecondYear_OTYear[grep("Anti-White", SecondYear_OTYear$BIAS_DESC), ]
antiWhite17 <- nrow(antiWhiteSubset[antiWhiteSubset$DATA_YEAR == "2017", ])
antiWhite09 <- nrow(antiWhiteSubset[antiWhiteSubset$DATA_YEAR == "2009", ])

antiBlackSubset <- SecondYear_OTYear[grep("Anti-Black", SecondYear_OTYear$BIAS_DESC), ]
antiBlack17 <- nrow(antiBlackSubset[antiBlackSubset$DATA_YEAR == "2017", ])
antiBlack09 <- nrow(antiBlackSubset[antiBlackSubset$DATA_YEAR == "2009", ])

antiLatinoSubset <- SecondYear_OTYear[grep("Anti-Hispanic", SecondYear_OTYear$BIAS_DESC), ]
antiLatino17 <- nrow(antiLatinoSubset[antiLatinoSubset$DATA_YEAR == "2017", ])
antiLatino09 <- nrow(antiLatinoSubset[antiLatinoSubset$DATA_YEAR == "2009", ])

And, I proceeded to do all of the different bias descriptions with the same structure. Then, I created a matrix of the totals to create varying bar plots, mosaic plots, or chi-square analysis, such as the following:

Bar plot of Hate Crime Incidents by Bias Descriptions:

However, I feel like there is a more efficient way to code for the different subsets... I'm open to any suggestions! Thank you so much.

can you provide a reproducible example of your dataset ? because on the link you provided, we need to be registered in order to download data. See: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — dc37, Jan 07 '20 at 20:14
and after `table` you can then do `barplot(table(...), beside=TRUE)` — user20650, Jan 07 '20 at 20:24

score 1 · Answer 1 · answered Jan 07 '20 at 20:38

1

You can use dplyr to filter the data and ggplot2::geom_bar to summarize counts.

hc_small = hate_crimes %>% filter(DATA_YEAR %in% c(2009, 2017))
top_5 = hc_small %>% count(BIAS_DESC, sort=TRUE) %>% pull(BIAS_DESC) %>% head(5)
hc_5 = hc_small %>% filter(BIAS_DESC %in% top_5)

ggplot(hc_5, aes(BIAS_DESC, fill=BIAS_DESC)) + 
  geom_bar() + 
  facet_wrap(~DATA_YEAR) +
  coord_flip() +
  theme_minimal() +
  guides(fill='none')

answered Jan 07 '20 at 20:38

Kent Johnson

3,320
1
22
23

Thank you so much! I appreciate it. However, what if I chose to do a stacked barplot or wanted both 2009 and 2017 barplots in the same plot so it's easier to compare? – Louisse Bye Jan 07 '20 at 20:41
The original question aggregated occurrence of the phrase, e.g., 'Anti-White' across `BIAS_DESC`, e.g., `hate_crime %>% filter(grepl("Anti-White", BIAS_DESC)) %>% count(BIAS_DESC)` captures 22 different categories, so `top_5` (maybe obtained using `top_n()`?) isn't good enough... – Martin Morgan Jan 07 '20 at 20:50
1

Martin is right, more data munging may be needed to get the categories of interest. See `?geom_bar` for examples of stacked bar plots. To stack the plots as in your example, use `facet_wrap(~DATA_YEAR, ncol=1)`. – Kent Johnson Jan 07 '20 at 21:43

Martin Morgan · Accepted Answer · 2020-01-09T03:52:49.257

To aggregate across phrases as in the original question, I did

anti <- 
    hate_crime %>% 
    filter(DATA_YEAR %in% c("2009", "2017")) %>% 
    mutate(
        ANTI_WHITE = grepl("Anti-White", BIAS_DESC),
        ANTI_BLACK = grepl("Anti-Black", BIAS_DESC),
        ANTI_HISPANIC = grepl("Anti-Hispanic", BIAS_DESC)
    ) %>% 
    select(DATA_YEAR, starts_with("ANTI"))

I then created the counts of each occurrence with group_by() and summarize_all() (noting that the sum() of a logical vector is the number of TRUE occurrences), and used pivot_longer() to create a 'tidy' summary

anti %>% 
    group_by(DATA_YEAR) %>%
    summarize_all(~ sum(.)) %>%
    tidyr::pivot_longer(starts_with("ANTI"), "BIAS", values_to = "COUNT")

The result is something like (there were errors importing the data with read_csv() that I did not investigate)

# A tibble: 6 x 3
  DATA_YEAR BIAS          COUNT
      <dbl> <chr>         <int>
1      2009 ANTI_WHITE      539
2      2009 ANTI_BLACK     2300
3      2009 ANTI_HISPANIC   486
4      2017 ANTI_WHITE      722
5      2017 ANTI_BLACK     2101
6      2017 ANTI_HISPANIC   444

Visualization seems like a second, separate, question.

The code can be made a little simpler by defining a function

n_with_bias <- function(x, bias)
    sum(grepl(bias, x))

and then avoiding the need to separately mutate the data

hate_crime %>%
    filter(DATA_YEAR %in% c("2009", "2017")) %>%
    group_by(DATA_YEAR) %>%
    summarize(
        ANTI_WHITE = n_with_bias(BIAS_DESC, "Anti-White"),
        ANTI_BLACK = n_with_bias(BIAS_DESC, "Anti-Black"),
        ANTI_HISPANIC = n_with_bias(BIAS_DESC, "Anti-Hispanic")
    ) %>%
    tidyr::pivot_longer(starts_with("ANTI"), names_to = "BIAS", values_to = "N")

On the other hand, a base R approach might create vectors for years-of-interest and all biases (using strsplit() to isolate the components of the compound biases)

years <- c("2009", "2017")
biases <- unique(unlist(strsplit(hate_crime$BIAS_DESC, ";")))

then create vectors of biases in each year of interest

bias_by_year <- split(hate_crime$BIAS_DESC, hate_crime$DATA_YEAR)[years]

and iterate over each year and bias (nested iterations can be inefficient when there are a large, e.g., 10,000's, number of elements, but that's not a concern here)

sapply(bias_by_year, function(bias) sapply(biases, n_with_bias, x = bias))

The result is a classic data.frame with all biases in each year

                                                          2009 2017
Anti-Black or African American                            2300 2101
Anti-White                                                 539  722
Anti-Jewish                                                932  983
Anti-Arab                                                    0  106
Anti-Protestant                                             38   42
Anti-Other Religion                                        111   85
Anti-Islamic (Muslim)                                        0    0
Anti-Gay (Male)                                              0    0
Anti-Asian                                                 128  133
Anti-Catholic                                               52   72
Anti-Heterosexual                                           21   33
Anti-Hispanic or Latino                                    486  444
Anti-Other Race/Ethnicity/Ancestry                         296  280
Anti-Multiple Religions, Group                              48   52
Anti-Multiple Races, Group                                 180  202
Anti-Lesbian (Female)                                        0    0
Anti-Lesbian, Gay, Bisexual, or Transgender (Mixed Group)    0    0
Anti-American Indian or Alaska Native                       68  244
Anti-Atheism/Agnosticism                                    10    6
Anti-Bisexual                                               24   24
Anti-Physical Disability                                    24   66
Anti-Mental Disability                                      70   89
Anti-Gender Non-Conforming                                   0   13
Anti-Female                                                  0   48
Anti-Transgender                                             0  117
Anti-Native Hawaiian or Other Pacific Islander               0   15
Anti-Male                                                    0   25
Anti-Jehovah's Witness                                       0    7
Anti-Mormon                                                  0   12
Anti-Buddhist                                                0   15
Anti-Sikh                                                    0   18
Anti-Other Christian                                         0   24
Anti-Hindu                                                   0   10
Anti-Eastern Orthodox (Russian, Greek, Other)                0    0
Unknown (offender's motivation not known)                    0    0

This avoids the need to enter each bias in the summarize() step. I'm not sure how to do that computation in a readable tidy-style analysis.

Note that in the table above any bias with a ( has zeros in both years. This is because grepl() treats ( in the bias as a grouping symbol; fix this by adding fixed = TRUE

n_with_bias <- function(x, bias)
    sum(grepl(bias, x, fixed = TRUE))

and an updated result

                                                          2009 2017
Anti-Black or African American                            2300 2101
Anti-White                                                 539  722
Anti-Jewish                                                932  983
Anti-Arab                                                    0  106
Anti-Protestant                                             38   42
Anti-Other Religion                                        111   85
Anti-Islamic (Muslim)                                      107  284
Anti-Gay (Male)                                            688  692
Anti-Asian                                                 128  133
Anti-Catholic                                               52   72
Anti-Heterosexual                                           21   33
Anti-Hispanic or Latino                                    486  444
Anti-Other Race/Ethnicity/Ancestry                         296  280
Anti-Multiple Religions, Group                              48   52
Anti-Multiple Races, Group                                 180  202
Anti-Lesbian (Female)                                      186  133
Anti-Lesbian, Gay, Bisexual, or Transgender (Mixed Group)  311  287
Anti-American Indian or Alaska Native                       68  244
Anti-Atheism/Agnosticism                                    10    6
Anti-Bisexual                                               24   24
Anti-Physical Disability                                    24   66
Anti-Mental Disability                                      70   89
Anti-Gender Non-Conforming                                   0   13
Anti-Female                                                  0   48
Anti-Transgender                                             0  117
Anti-Native Hawaiian or Other Pacific Islander               0   15
Anti-Male                                                    0   25
Anti-Jehovah's Witness                                       0    7
Anti-Mormon                                                  0   12
Anti-Buddhist                                                0   15
Anti-Sikh                                                    0   18
Anti-Other Christian                                         0   24
Anti-Hindu                                                   0   10
Anti-Eastern Orthodox (Russian, Greek, Other)                0   22
Unknown (offender's motivation not known)                    0    0

This is more what I was looking for! Thank you!! :) – Louisse Bye Jan 07 '20 at 21:18 — Louisse Bye, Jan 07 '20 at 21:18

How to find totals of different categories in same dataset?

2 Answers2