I have a data frame composed of 5 variables :
- Race
- Job
- City
- Year
- Frequency (integer)
Reproducible data is:
set.seed(4)
Race <- rep(c("Black", "White"), times = 4)
Job <- rep(c("BrickLayer","Cleaner"), each = 4)
City <- rep(c("New York City","New York City", "New Haven", "New Haven"), times = 4)
Year <- rep(c("2018","2019"), each = 8)
Frequency <- lapply(1,function(x){sample(100:542,4,replace = T)})
TestFrame <- data.frame(Race, Job, City, Year, Frequency)
colnames(TestFrame) <- c('Race', 'Job', 'City', 'Year', 'Frequency')
View(TestFrame)
I want to know how much of each race composes each % of job, per city, per year.
So, I'd like a table saying that
White - Brick Layer - New York City - 2018 - 0,54
Black - Brick Layer - New York City - 2018 - 0,46
....
White - Brick Layer - New York City - 2019 - 0,47
Black - Brick Layer - New York City - 2019 - 0,53
To then plot this difference, for all variables.
So, I want to make a big table and a big plot showing how the composition of each job/city is changing over the years. I have the ggplot code already, but I'm confused on how I can efficiently sum up:
all Brick Layers, in New York, in 2018
then
Get how many of those are white, how many of those are black
In my original data, I'd have to do that about a thousand times if done by hand, and thats for only one table, and I'm probably going to do about 5 of those.