1

I'm trying to show the distribution of salaries for a particular occupation. The BLS data is provided with respect to data by county. When I use the code below I almost get what I want but the problem is that the count being considered for the y axis is a count of the rows, which is count of counties.

So for a county with 10 employees and average income of 50k, that is being considered an equal count to a row that has 100 employees and average income of 80k. I know I could do it by expanding each county row by the number of employees, returning 10 rows of 50k and 100 rows of 80k, but I'm sure there is a better approach I just can't find it.

ggplot(Construction[which(Construction$avg_annual_pay>0),], aes(x=avg_annual_pay)) + 
  geom_histogram(binwidth = 5000, colour="black", fill="white") + 
  scale_x_continuous(labels = label_comma())
county avg # employees avg annual pay
1 34 47000
2 900 88000
3 85 40000

Tried making y=avg_employees but geom_histogram doesn't allow for use of both x and y arguments.

Edit:

        qcewGetIndustryData <- function (year, qtr, industry) {
      url <- "http://data.bls.gov/cew/data/api/YEAR/QTR/industry/INDUSTRY.csv"
      url <- sub("YEAR", year, url, ignore.case=FALSE)
      url <- sub("QTR", tolower(qtr), url, ignore.case=FALSE)
      url <- sub("INDUSTRY", industry, url, ignore.case=FALSE)
      read.csv(url, header = TRUE, sep = ",", quote="\"", dec=".", na.strings=" ", skip=0)
    }
    
    Construction <- qcewGetIndustryData("2015", "a", "1012")

Edit2:

> head(Construction[,1:5])
  area_fips own_code industry_code agglvl_code size_code
1     01000        3          1012          53         0
2     01000        5          1012          53         0
3     01001        5          1012          73         0
4     01003        5          1012          73         0
5     01005        5          1012          73         0
6     01007        5          1012          73         0
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
creetz
  • 21
  • 3
  • 1
    Can you provide `dput(Construction)` and any code you've tried out so far, even if you think it's clunky? Is the table you provided similar to the data you currently have or is the table indicative of what you want your final output to look like? – jrcalabrese Feb 15 '23 at 18:12
  • 2
    I think you just want a weighted histogram: https://stackoverflow.com/questions/19841204/create-a-histogram-for-weighted-values – MrFlick Feb 15 '23 at 18:23
  • @MrFlick yes and no, I had the same thought at first, but applying weights would still be plotting the distribution of the per county when I want the distribution of the salaries across the population as a whole. It sounds like it might work if I figured out a way to assign weights to each county proportional to the counties employee population to the total population. But doing that sounds like it's overcomplicating the problem, and I am not 100% confident the outcome would be truly representative of what I actually want. Thank you for the link tho – creetz Feb 16 '23 at 23:56
  • @jrcalabrese I added the code that creates the Construction data set, although I'm not familiar with dput and the output when I run it on Construction contains so much text it extends beyond what fits in the console window. so I'm not sure how to share it with you. The table I provided is an example of the current dataset. – creetz Feb 17 '23 at 10:03

0 Answers0