0

I have a dataset about a university's student body with 10 columns that represent different factors such as their student id, gender, ethnicity, etc.

For right now I'm just interested in the term they were admitted, and their ethnicity because I want to see how the number of students from different ethnic backgrounds has changed over time. So I created a new data frame with two columns called ethnicitydf:

> head(ethnicitydf)
  admit_term                  ethn_desc
1 2011-10-01            White/Caucasian
2 2011-10-01 Filipino/Filipino-American
3 2011-10-01            White/Caucasian
4 2011-10-01       Latino/Other Spanish
5 2011-10-01      East Indian/Pakistani
6 2011-10-01            White/Caucasian

I'm not exactly sure how I would create a plot that has the admit_term (time) in the x-axis and the frequency that each ethnicity occurs for each admit_term. There are 12 unique ethnicities in the second column and I want to have the frequency of all 12 ethnicities for each admit_term (6 terms in total) in one graph, each ethnicity having a different color.

The first step I was thinking was counting up each ethnicity for each term using length(which(ethnicitydf$admit_term == "2011-10-01" & ethnicitydf$ethn_desc == "White/Caucasian")) for example and recording the data in a new data frame, but I feel like there should be a faster and more efficient way of doing this. Maybe the use of a package? Could any body help me out? Thank you!

John
  • 45
  • 4
  • Sounds like you want an aggregation function where you get a count for each `admit_term` + `ethn_desc` combination - see here for many options - https://stackoverflow.com/questions/9809166/count-number-of-rows-within-each-group – thelatemail Jul 03 '19 at 00:14

1 Answers1

2

A bar plot will do the counts for you.

library(ggplot2)

ethnicitydf <- data.frame(admit_term = sample(c("2011-10-01","2012-10-01","2013-10-01"), 100, TRUE),
                          ethn_desc =sample(c("White/Caucasian","Filipino/Filipino-American","East Indian/Pakistani"), 100, TRUE))

ggplot() +
    geom_bar(data=ethnicitydf, mapping=aes(x=admit_term, fill=ethn_desc), position="dodge")

Created on 2019-07-03 by the reprex package (v0.3.0)

You can also just plot points if you have a lot of series, like this.

ggplot() +
    geom_point(data=ethnicitydf, mapping=aes(x=admit_term, colour=ethn_desc), stat="count")

To get lines you will need to make sure your y axis is numeric (turns the text dates into numbers, e.g. years).

ethnicitydf$admit_term <- as.Date(ethnicitydf$admit_term)

ggplot() +
    geom_line(data=ethnicitydf, mapping=aes(x=admit_term, colour=ethn_desc), stat="count") +
    geom_point(data=ethnicitydf, mapping=aes(x=admit_term, colour=ethn_desc), stat="count")

Simon Woodward
  • 1,946
  • 1
  • 16
  • 24