0

I have a dataset of jobs for individuals along with some information on salaries for certain occupations, and I am trying to create a subset that standardizes job names through fuzzy matching. Specifically, a job title called "Cost Accountant" with monthly wage of $4000 and "Financial Accountant" with $5000 would be matched under a new column called "Accountant" that computes the average of the jobs with similar names.

Here is my code thus far: #upload packages

library(stringr)
library(dplyr)
# Print data example with specific columns
dput(job_posts[1:20,c(4,27)])

output:

structure(list(jobtitle = c("PE Teacher", "Accountant", 
"Dewatering Supervisor", "sales account manager", "Sales Lead", 
"Assistant Housekeeping Manager", "Quality Manager", "Approval Officer", 
"Logistics", "Systems Engineer - Networking/Wireless", "Accountant", 
"Calls Admin", "Financial Accountant", "Sales Representative", 
"Procurement Assistant", "Water Quality Analyst", "Resident Engineer", 
"Cost Accountant", "Product Specilaist-2", "Operations Coordinator"
), monthly_income = c(NA, 8500, NA, 20000, 15000, NA, 3500, NA, 
NA, 4000, NA, 500, NA, 5000, NA, 8500, 20000, 9000, 4100, 4500)), row.names = c(NA, 
-20L), class = c("tbl_df", "tbl", "data.frame"))

I have followed the instructions here which gave me a good start because it flags other rows/observations that have been matched, but I am not able to standardize job titles as I explained in the example earlier.

# fuzzy matching for job titles, so that similar jobs are stored in one df
job_posts$matched <- sapply(job_posts$jobtitle,agrep,job_posts$jobtitle)
# Print data example with specific columns
dput(job_posts[1:10,c(4,27,28)])

output:

structure(list(jobtitle = c("PE Teacher", "Accountant", 
"Dewatering Supervisor", "sales account manager", "Sales Lead", 
"Assistant Housekeeping Manager", "Quality Manager", "Approval Officer", 
"Logistics", "Systems Engineer - Networking/Wireless"), monthly_income = c(NA, 
8500, NA, 20000, 15000, NA, NA, NA, NA, NA), matched = list(`PE Teacher` = c(1L, 
1111L), `Accountant` = 2L, 
    `Dewatering Supervisor` = 3L, `sales account manager` = c(4L, 
    1242L, 1309L, 1524L, 1783L), `Sales Lead` = c(5L, 1984L), 
    `Assistant Housekeeping Manager` = 6L, `Quality Manager` = c(7L, 
    196L, 650L, 1856L, 2330L), `Approval Officer` = 8L, Logistics = c(9L, 
    71L, 129L, 176L, 362L, 444L, 446L, 587L, 655L, 935L, 1413L, 
    1508L, 1835L, 2176L, 2300L, 2370L, 2657L, 2685L, 2770L), 
    `Systems Engineer - Networking/Wireless` = 10L)), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"))

The current df looks as follows:

jobtitle                 avg_wage
Financial Accountant     $5000   
Cost Accountant          $4000
Retail Accountant        $4000

The desired outcome is as follows, where the average wage is based on a mean of all accounting wages and instead of "cost accountant" or "financial accountant", all accounting jobs would be something like "Accountant"

jobtitle       avg_wage
Accountant     $4333  
maldini1990
  • 279
  • 2
  • 11
  • 1
    hi nesta! Can you show me what your desired output is? – Mark Aug 11 '23 at 03:50
  • 1
    thanks for that! another thing- neither of the dataframes you included have 'a job title called "Cost Accountant" with monthly wage of $4000' nor a '"Financial Accountant" with $5000' – Mark Aug 11 '23 at 09:39
  • Correct @Mark, I just gave that as an example of similar sounding jobs in my dataset, hope that clarifies my desired output. – maldini1990 Aug 11 '23 at 11:26

1 Answers1

1

I think this is what you want? Though I'm not entirely sure:

library(tidyverse)

# the same as the smallest example dataframe you gave, with an extra irrelevant row for demonstration
data <- data.frame(
  jobtitle = c("Financial Accountant", "Cost Accountant", "Retail Accountant", "Instagram Influencer"),
  avg_wage = c("$5000", "$4000", "$4000", "$1000")
)

# same with this
job_groups <- c("Accountant", "Butcher", "Baker", "Candlestick Maker")

# basically what's happening here is we're looking for the job group in each job title, removing NA values, then if there's no job group in the title, we're returning NA, else returning the job title(s)
mutate(data, grp = map_chr(jobtitle, ~ str_extract(.x, job_groups) %>% {.[!is.na(.)]} %>% if (length(.) == 0) NA_character_ else .))

Output:

              jobtitle avg_wage        grp
1 Financial Accountant    $5000 Accountant
2      Cost Accountant    $4000 Accountant
3    Retail Accountant    $4000 Accountant
4 Instagram Influencer    $1000       <NA>
Mark
  • 7,785
  • 2
  • 14
  • 34