0

I have a data frame composed of 5 variables :

  • Race
  • Job
  • City
  • Year
  • Frequency (integer)

Reproducible data is:

set.seed(4)
Race <- rep(c("Black", "White"), times = 4)
Job <- rep(c("BrickLayer","Cleaner"), each = 4)
City <- rep(c("New York City","New York City", "New Haven", "New Haven"), times = 4)
Year <- rep(c("2018","2019"), each = 8)
Frequency <- lapply(1,function(x){sample(100:542,4,replace = T)})

TestFrame <- data.frame(Race, Job, City, Year, Frequency)
colnames(TestFrame) <- c('Race', 'Job', 'City', 'Year', 'Frequency')
View(TestFrame)

I want to know how much of each race composes each % of job, per city, per year.

So, I'd like a table saying that

White - Brick Layer - New York City - 2018 - 0,54

Black - Brick Layer - New York City - 2018 - 0,46

....

White - Brick Layer - New York City - 2019 - 0,47

Black - Brick Layer - New York City - 2019 - 0,53

To then plot this difference, for all variables.

So, I want to make a big table and a big plot showing how the composition of each job/city is changing over the years. I have the ggplot code already, but I'm confused on how I can efficiently sum up:

all Brick Layers, in New York, in 2018

then

Get how many of those are white, how many of those are black

In my original data, I'd have to do that about a thousand times if done by hand, and thats for only one table, and I'm probably going to do about 5 of those.

Lelleo
  • 59
  • 6
  • 2
    What have you tried so far? –  Aug 04 '22 at 05:14
  • I have no clue on where to start. I could only think of doing it manually but Im sure there is a better way – Lelleo Aug 04 '22 at 05:17
  • 2
    Are you familiar with ggplot2 and dplyr? Your question seems like a great opportunity for you to get better at those two libraries because it doesn't sound too difficult just a some effort toying with the data –  Aug 04 '22 at 05:22
  • 2
    IMO I think you should take some time using those two libraries and come back and ask when you hit a roadblock –  Aug 04 '22 at 05:25
  • the plot part is ok for me, creating the data to plot that im really clueless. Any tips on what functions I should use? – Lelleo Aug 04 '22 at 05:30
  • 1
    You question calls for some grouping. The `group_by` function is the grouping function you'll need to use in dplyr. Try to keep the variables you want to plot in long format so your life is easier with ggplot2. Then you'll need to `summarise` or `mutate` to get the percentages that you'll use to plot –  Aug 04 '22 at 05:36
  • See here for a solution to one part: [Relative frequencies / proportions with dplyr](https://stackoverflow.com/questions/24576515/relative-frequencies-proportions-with-dplyr) – socialscientist Aug 04 '22 at 16:52

0 Answers0