Search column for a string and Classify them with a dictionary keys

Question

I've imported a spreadsheet I'd exported from Linkedin of my connections and want to categorize people's positions at different levels.

So, I've created a dictionary with the terms to look up each position level.

A first version of the dictionary would be:

dicpositions = {'0 - CEO, Founder': ['CEO', 'Founder', 'Co-Founder', 'Cofounder', 'Owner'],
                '1 - Director of': ['Director', 'Head'], 
                '2 - Manager': ['Manager', 'Administrador'], 
                '3 - Engenheiro': ['Engenheiro', 'Engineering'], 
                '4 - Consultor': ['Consultor', 'Consultant'], 
                '5 - Estagiário': ['Estagiário', 'Intern'], 
                '6 - Desempregado': ['Self-Employed', 'Autônomo'], 
                '7 - Professor': ['Professor', 'Researcher'] }

And I need a code to read each position in my spreadsheet, check if there is any of these terms and return the equivalent key in another specific column.

A sample data of the dataframe i'm reading would be:

sample = pd.Series(data = (['(blank)'], ['Estagiário'], ['Professor', 'Adjunto'], 
                           ['CEO', 'and', 'Founder'], ['Engenheiro', 'de', 'Produção'], 
                           ['Consultant'], ['Founder', 'and', 'CTO'], 
                           ['Intern'], ['Manager', 'Specialist'], 
                           ['Administrador', 'de', 'Novos', 'Negócios'], 
                           ['Administrador', 'de', 'Serviços']))

Which Returns:

0                                [(blank)]
1                             [Estagiário]
2                     [Professor, Adjunto]
3                      [CEO, and, Founder]
4               [Engenheiro, de, Produção]
5                             [Consultant]
6                      [Founder, and, CTO]
7                                 [Intern]
8                    [Manager, Specialist]
9     [Administrador, de, Novos, Negócios]
10           [Administrador, de, Serviços]
dtype: object

I've done the following code:

import pandas as pd
plan = pd.read_excel('SpreadSheet Name.xlsx', sheet_name = 'Positions')

list0 = ['CEO', 'Founder', 'Co-Founder', 'Cofounder', 'Owner']
list1 = ['Director', 'Head']
list2 = ['Manager', 'Administrador']   
listgeral = [dic0, dic1, dic2]

def in_list(list_to_search,terms_to_search):     
    results = [item for item in list_to_search if item in terms_to_search]
    if len(results) > 0:
        return '0 - CEO, Founder'        
    else:
        pass
plan['PositionLevel'] = plan['Position'].str.split().apply(lambda x: in_list(x, listgeral[0]))

Actual output:

                                          Position           PositionLevel
0                                        '(blank)'                None
1                                     'Estagiário'                None
2                              'Professor Adjunto'                None
3                                'CEO and Founder'         '0 - CEO, Founder'
4                         'Engenheiro de produção'                None
5                                     'Consultant'                None
6                                'Founder and CTO'         '0 - CEO, Founder'
7                                         'Intern'                None
8                             'Manager Specialist'                None
9                'Administrador de Novos Negócios'                None

Expected output:

                                            Position         PositionLevel
0                                          '(blank)'              None
1                                       'Estagiário'       '5 - Estagiário'
2                                'Professor Adjunto'       '7 - Professor'
3                                  'CEO and Founder'      '0 - CEO, Founder'
4                           'Engenheiro de produção'       '3 - Engenheiro'
5                                       'Consultant'       '4 - Consultor'
6                                  'Founder and CTO'      '0 - CEO, Founder'
7                                           'Intern'       '5 - Estagiário'
8                               'Manager Specialist'        '2 - Manager'
9                  'Administrador de Novos Negócios'        '2 - Manager'

First I was planning to run that code for every list inside my listgeral, but I coundn't do so. Then I started to believe it would be better to apply this one for a big dictionary, just as the dicpositions from the beginning of the question, and the return the key of the term.

I've tried to apply the following code to this program:

dictest = {'0 - CEO, Founder': ['CEO', 'Founder', 'Co-Founder', 'Cofounder', 'Owner'], 
           '1 - Director of': ['Director', 'Head'], 
           '2 - Manager': ['Manager', 'Administrador']}

def in_dic (x, dictest):
    for key in dictest:
        for elem in dictest[key]:
            if elem == x:
                return key
    return False

Where the output from in_dic('CEO', dictest) Is '0 - CEO, Founder'

And, for example, the output from in_dic('Banana', dictest) is False

But I couldn't advance from it and apply this function in_dic() to my problem.

I would really appreciate anybody's help.

Thanks a lot.

score 0 · Accepted Answer · answered Mar 06 '20 at 15:11

I took the liberty to refactor your input a bit, but here's what I've got (it might be slightly over-engineered). In short, we use a library called jellyfish (pip3 install jellyfish, code taken from this answer) to do fuzzy string-matching to match positions in your excel sheet with positions in your dicpositions, then just map those to the category in the same dict. Here's the imports and the matching function:

import pandas as pd
import numpy as np
import jellyfish


# Function for fuzzy-matching strings
def get_closest_match(x, list_strings):
    best_match = None
    highest_jw = 0

    # Keep an eye out for "blank" values, they can be strings, e.g. "(blank)", or e.g. NaN values
    no_values = ["(blank)", np.nan, None]
    if x in no_values:
        return "(blank)"

    # Find which string most closely matches our input and return it
    for current_string in list_strings:
        current_score = jellyfish.jaro_winkler(x, current_string)

        if current_score > highest_jw:
            highest_jw = current_score
            best_match = current_string

    return best_match

Alright, and here's your dicpositions, which I translate to a long-format DataFrame for convenience:

# Translations between keywords and their category, as dict, as provided in question
dicpositions = {'0 - CEO, Founder': ['CEO', 'Founder', 'Co-Founder', 'Cofounder', 'Owner'],
                '1 - Director of': ['Director', 'Head'],
                '2 - Manager': ['Manager', 'Administrador'],
                '3 - Engenheiro': ['Engenheiro', 'Engineering'],
                '4 - Consultor': ['Consultor', 'Consultant'],
                '5 - Estagiário': ['Estagiário', 'Intern'],
                '6 - Desempregado': ['Self-Employed', 'Autônomo'],
                '7 - Professor': ['Professor', 'Researcher'],
                'Not found"': ["(blank)"]  # <-- I added this to deal with blank values
}

# Let's expand the dict above to a DF, which makes for easier merging later
positions = []
aliases = []
for key, val in dicpositions.items():
    for v in val:
        positions.append(key)
        aliases.append(v)
# This will serve as our mapping table
lookup_table = pd.DataFrame({
    "position": positions,
    "alias": aliases
})
print(lookup_table)

Instead of being a dictionary, it's a long-format DataFrame. This format makes it really easy to match categories with various keywords later:

            position          alias
0   0 - CEO, Founder            CEO
1   0 - CEO, Founder        Founder
2   0 - CEO, Founder     Co-Founder
3   0 - CEO, Founder      Cofounder
4   0 - CEO, Founder          Owner
5    1 - Director of       Director
6    1 - Director of           Head
7        2 - Manager        Manager
8        2 - Manager  Administrador
9     3 - Engenheiro     Engenheiro
10    3 - Engenheiro    Engineering
11     4 - Consultor      Consultor
12     4 - Consultor     Consultant
13    5 - Estagiário     Estagiário
14    5 - Estagiário         Intern
15  6 - Desempregado  Self-Employed
16  6 - Desempregado       Autônomo
17     7 - Professor      Professor
18     7 - Professor     Researcher
19        Not found"        (blank)

Let's test some input and see how the matching works. We check every string in your input with the strings in the alias column, and return whichever value in the alias column most closely match our input data (we'll use this again later, to find the category, or position):

# Test input, as a list, you might have to wrangle it from your format to a list, though
test_df = pd.DataFrame({"test_position": ["(blank)", 'Estagiário', 'Professor Adjunto', 'CEO and Founder', 'Engenheiro de produção', 'Consultant', 'Founder and CTO', 'Intern', 'Manager Specialist', 'Administrador de Novos Negócios']})

# Match our test input with our mapping table, create a new column 'best_match' representing the value in our mapping table that most closely matches our input
test_df["best_match"] = test_df.test_position.map(lambda x: get_closest_match(x, lookup_table.alias))
print(test_df)

A new column is added to our test_df, indicating which alias in our lookup table is most similar to our test_position input:

                    test_position     best_match
0                          (blank)        (blank)
1                       Estagiário     Estagiário
2                Professor Adjunto      Professor
3                  CEO and Founder            CEO
4           Engenheiro de produção     Engenheiro
5                       Consultant     Consultant
6                  Founder and CTO        Founder
7                           Intern         Intern
8               Manager Specialist        Manager
9  Administrador de Novos Negócios  Administrador

To end up with the category, we simply merge our best_match column in our testing data with the alias column of our lookup table:

result = test_df.merge(lookup_table, left_on="best_match", right_on="alias", how="left")

Resulting in this:

                    test_position     best_match          alias          position
0                          (blank)        (blank)        (blank)         Not found
1                       Estagiário     Estagiário     Estagiário    5 - Estagiário
2                Professor Adjunto      Professor      Professor     7 - Professor
3                  CEO and Founder            CEO            CEO  0 - CEO, Founder
4           Engenheiro de produção     Engenheiro     Engenheiro    3 - Engenheiro
5                       Consultant     Consultant     Consultant     4 - Consultor
6                  Founder and CTO        Founder        Founder  0 - CEO, Founder
7                           Intern         Intern         Intern    5 - Estagiário
8               Manager Specialist        Manager        Manager       2 - Manager
9  Administrador de Novos Negócios  Administrador  Administrador       2 - Manager

Plasma, Thank you so much! It worked pretty well to my spreadsheet! I still have to complete better the ```dicpositions```, and then do more tests. But for now it's working perfectly, and it's quite scalable!! I'm really thankful for your help! — Daniel Flórido Leal, Mar 07 '20 at 00:11

Search column for a string and Classify them with a dictionary keys

1 Answers1