I have a dataset of jobs for individuals along with some information on salaries for certain occupations, and I am trying to create a subset that standardizes job names through fuzzy matching. Specifically, a job title called "Cost Accountant" with monthly wage of $4000 and "Financial Accountant" with $5000 would be matched under a new column called "Accountant" that computes the average of the jobs with similar names.
Here is my code thus far: #upload packages
library(stringr)
library(dplyr)
# Print data example with specific columns
dput(job_posts[1:20,c(4,27)])
output:
structure(list(jobtitle = c("PE Teacher", "Accountant",
"Dewatering Supervisor", "sales account manager", "Sales Lead",
"Assistant Housekeeping Manager", "Quality Manager", "Approval Officer",
"Logistics", "Systems Engineer - Networking/Wireless", "Accountant",
"Calls Admin", "Financial Accountant", "Sales Representative",
"Procurement Assistant", "Water Quality Analyst", "Resident Engineer",
"Cost Accountant", "Product Specilaist-2", "Operations Coordinator"
), monthly_income = c(NA, 8500, NA, 20000, 15000, NA, 3500, NA,
NA, 4000, NA, 500, NA, 5000, NA, 8500, 20000, 9000, 4100, 4500)), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
I have followed the instructions here which gave me a good start because it flags other rows/observations that have been matched, but I am not able to standardize job titles as I explained in the example earlier.
# fuzzy matching for job titles, so that similar jobs are stored in one df
job_posts$matched <- sapply(job_posts$jobtitle,agrep,job_posts$jobtitle)
# Print data example with specific columns
dput(job_posts[1:10,c(4,27,28)])
output:
structure(list(jobtitle = c("PE Teacher", "Accountant",
"Dewatering Supervisor", "sales account manager", "Sales Lead",
"Assistant Housekeeping Manager", "Quality Manager", "Approval Officer",
"Logistics", "Systems Engineer - Networking/Wireless"), monthly_income = c(NA,
8500, NA, 20000, 15000, NA, NA, NA, NA, NA), matched = list(`PE Teacher` = c(1L,
1111L), `Accountant` = 2L,
`Dewatering Supervisor` = 3L, `sales account manager` = c(4L,
1242L, 1309L, 1524L, 1783L), `Sales Lead` = c(5L, 1984L),
`Assistant Housekeeping Manager` = 6L, `Quality Manager` = c(7L,
196L, 650L, 1856L, 2330L), `Approval Officer` = 8L, Logistics = c(9L,
71L, 129L, 176L, 362L, 444L, 446L, 587L, 655L, 935L, 1413L,
1508L, 1835L, 2176L, 2300L, 2370L, 2657L, 2685L, 2770L),
`Systems Engineer - Networking/Wireless` = 10L)), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
The current df looks as follows:
jobtitle avg_wage
Financial Accountant $5000
Cost Accountant $4000
Retail Accountant $4000
The desired outcome is as follows, where the average wage is based on a mean of all accounting wages and instead of "cost accountant" or "financial accountant", all accounting jobs would be something like "Accountant"
jobtitle avg_wage
Accountant $4333