How to find the same job titles that are written differently?

Question

I want to find people who have the job title (for example) Market Research Coordinator on their resume, but they may have written it differently, for example:

Marketing Research Coordinator
Market Researching Coordinator
Markets Research Coordinator
Market Researches Coordinator
Marketing Research Coordinator
Markets Researchers Coordinator
Market Researcher Coordinators
Marketing Researcher Coordinators
...

If I want to match with == I will not get good results, stemming and lemmatization also have difficulty identifying such differences.
Another option is to use a similarity metric between two strings (which is discussed in this question), but it will be very time consuming and probably not a good method, also in this method, determining the threshold is another problem.
Does an intelligent person have an idea?

Yasi Klingler · Answer 1 · 2020-11-05T14:34:47.883

1

i do not accept that stemming and lemmatization does not work! you can tokenize your inputs. then get the stem and for sure in the case of Marketing, if the language is selected correctly(check language is selected properly in your stemming package), you will get market. you should also make sure that you apply the stemming on both elements of your if statement!

in case there are dictation problems or small differences, you can use a Levenstein package and accept the inputs that are similar more than a ratio T.

example:

import nltk.stem.porter

p_stemmer = PorterStemmer()
print("the stem of marketing:",p_stemmer.stem('Marketing'))        
print("the stem of marketing research:",p_stemmer.stem('Marketing Research'))

and the results will be as:

the stem of marketing: 'market' (correct)

the stem of marketing research: 'marketing research' (not want we want)

As you can see, if the tokenization is not applied, the stemmer does not work as expected.

i would suggest the combination of all these(tokenization, stemming, and levenstein).

edited Nov 05 '20 at 14:34

answered Nov 05 '20 at 13:49

Yasi Klingler

606
6
13

I do not use stemmer because it has errors, for example it changes `verify` to `verifi`. But as you said if I use stemming for both problem may be solve. Thanks!. Meantime, `p_stemmer.stem('Marketing Research')` not work, because stemmer sees the entire sentence as a word, so it returns it as it is. We have to stem each word in the sentence and return a combined sentence. – Meysam Nov 05 '20 at 14:17
yes i wrote that example to show that writing a string as whole does not work. :) so you should apply word by word. yea applying on both sides gives verifi and then you can compare them. your welcome. :) please upvote the answer if it helped. thanks – Yasi Klingler Nov 05 '20 at 14:28
oops, I don't see this part of your text: "if the tokenization is not applied". – Meysam Nov 05 '20 at 14:34

score 1 · Answer 2 · answered Nov 05 '20 at 13:53

You can use the Python package textdistance to calculate the normalized similarity between strings, and only keep them if the similarity is higher than a certain threshold.

import textdistance

main_job = 'Marketing Research Coordinator'

other_jobs = ['Market Researching Coordinator', 'Markets Research Coordinator', 
              'Market Researches Coordinator', 'Marketing Research Coordinator', 
              'Markets Researchers Coordinator', 'Market Researcher Coordinators',
              'Marketing Researcher Coordinators', 'Marketing Researcher Executive',
              'Senior Advertising Analyst']

for job in other_jobs:
    distance = textdistance.jaccard.normalized_similarity(main_job, job)
    print(f'Similarity "{main_job}" & "{job}": {distance:.3f}')

Similarity "Marketing Research Coordinator" & "Market Researching Coordinator": 1.000
Similarity "Marketing Research Coordinator" & "Markets Research Coordinator": 0.871
Similarity "Marketing Research Coordinator" & "Market Researches Coordinator": 0.844
Similarity "Marketing Research Coordinator" & "Marketing Research Coordinator": 1.000
Similarity "Marketing Research Coordinator" & "Markets Researchers Coordinator": 0.794
Similarity "Marketing Research Coordinator" & "Market Researcher Coordinators": 0.818
Similarity "Marketing Research Coordinator" & "Marketing Researcher Coordinators": 0.909
Similarity "Marketing Research Coordinator" & "Marketing Researcher Executive": 0.579
Similarity "Marketing Research Coordinator" & "Senior Advertising Analyst": 0.436

Take a look at the last two examples.

Mahesh Anakali · Answer 3 · 2020-11-05T13:52:06.907

0

Use below regex pattern and check if the job title matches

import re
pattern = r'Market(\w*?) Research(\w*?) Coordinator'
print('Enter job title')
job_title = input()
if re.search(pattern, job_title):
    print('Job title matching with Market Research Coordinator')
else:
    print('Job title not matching with Market Research Coordinator')

edited Nov 05 '20 at 13:52

answered Nov 05 '20 at 13:45

Mahesh Anakali

344
1
8

How to find the same job titles that are written differently?

3 Answers3