Simplifying a list into categories

Question

I am a new Python developer and was wondering if someone can help me with this. I have a dataset that has one column that describes a company type. I noticed that the column has, for example, surgical, surgery listed. It has eyewear, eyeglasses and optometry listed. So instead of having a huge list in this column, i want to simply the category to say that if you find a word that contains "eye," "glasses" or "opto" then just change it to "eyewear." My initial code looks like this:

def map_company(row):
    company = row['SIC_Desc']
    if company in 'Surgical':
         return 'Surgical'
    elif company in ['Eye', 'glasses', 'opthal', 'spectacles', 'optometers']:
        return 'Eyewear'
    elif company in ['Cotton', 'Bandages', 'gauze', 'tape']:
        return 'First Aid'
    elif company in ['Dental', 'Denture']:
        return 'Dental'
    elif company in ['Wheelchairs', 'Walkers', 'braces', 'crutches', 'ortho']:
        return 'Mobility equipments'
    else:
        return 'Other'

df['SIC_Desc'] = df.apply(map_company,axis=1)

This is not correct though because it is changing every item into "Other," so clearly my syntax is wrong. Can someone please help me simplify this column that I am trying to relabel? Thank you

Why not just use a debugger? Debuggers are your friend, take the opportunity to learn to use one! — donkopotamus, Sep 15 '17 at 02:42
Sounds like you have your condition backwards. You want `'Dental' in company`. You can use `any()` to get the desired effect of comparing multiple values, e.g.: `elif any(i in company for i in ['Dental', 'Denture']): ...` — AChampion, Sep 15 '17 at 02:42
Hello Champion can you expand on your code? It doesn't seem to work. This is what I have: def map_company(row): company = row['SIC_Desc'] if any(i in company for i in ['Surgical', 'Surgery']): return 'Surgical' elif any(i in company for i in ['Eye', 'glasses', 'opthal', 'spectacles', 'optometers']): return 'Eyewear' elif any(i in company for i in ['Cotton', 'Bandages', 'gauze', 'tape']): return 'First Aid' else: return 'Other' df['SIC_Desc'] = df.apply(map_company,axis=1) — afzaaldeveloper1, Sep 15 '17 at 03:06
i would suggest using pandas library for dataset operation which will help you to manipulate the data easily. i you like this i could give you examples — Midhun Mohan, Sep 15 '17 at 04:37

Flynsee · Answer 1 · 2017-09-15T04:55:41.247

It is hard to answer without having the exact content of your data set, but I can see one mistake. According to your description, it seems you are looking at this the wrong way. You want one of the words to be in your company description, so it should look like that:

if any(test in company for test in ['Eye', 'glasses', 'opthal', 'spectacles', 'optometers'])

However you might have a case issue here so I would recommend:

company = row['SIC_Desc'].lower()
if any(test.lower() in company for test in ['Eye', 'glasses', 'opthal', 'spectacles', 'optometers']):
    return 'Eyewear'

You will also need to make sure company is a string and 'SIC_Desc' is a correct column name.

In the end your function will look like that:

def is_match(company,names):
    return any(name in company for name in names)

def map_company(row):
    company = row['SIC_Desc'].lower()
    if 'surgical' in company:
         return 'Surgical'
    elif is_match(company,['eye','glasses','opthal','spectacles','optometers']):
        return 'Eyewear'
    elif is_match(company,['cotton', 'bandages', 'gauze', 'tape']):
        return 'First Aid'
    else:
        return 'Other'

@Chinny84 Oh yes indeed! Thanks for noticing that.Edit: I fixed it — Flynsee, Sep 15 '17 at 04:55
so would you finish the code by writing df['SIC_Desc'] = df.apply(map_company,axis=1)? I have tried this and its still listing all of the rows as "Other" — afzaaldeveloper1, Sep 15 '17 at 19:30

pylang · Answer 2 · 2017-09-15T05:48:10.160

Here is an option using a reversed dictionary.

Code

import pandas as pd


# Sample DataFrame
s = pd.Series(["gauze", "opthal", "tape", "surgical", "eye", "spectacles", 
               "glasses",  "optometers", "bandages", "cotton", "glue"])
df = pd.DataFrame({"SIC_Desc": s})
df

LOOKUP = {
    "Eyewear": ["eye", "glasses", "opthal", "spectacles", "optometers"],
    "First Aid": ["cotton", "bandages", "gauze", "tape"],
    "Surgical": ["surgical"],
    "Dental": ["dental", "denture"],
    "Mobility": ["wheelchairs", "walkers", "braces", "crutches", "ortho"],
}

REVERSE_LOOKUP = {v:k for k, lst in LOOKUP.items() for v in lst}

def map_company(row):
    company = row["SIC_Desc"].lower()
    return REVERSE_LOOKUP.get(company, "Other")


df["SIC_Desc"] = df.apply(map_company, axis=1)
df

Details

We define a LOOKUP dictionary with (key, value) pairs of expected output and associated words, respectively. Note, the values are lowercase to simplify searching. Then we use a reversed dictionary to automatically invert the key value pairs and improve the search performance, e.g.:

>>> REVERSE_LOOKUP
{'bandages': 'First Aid',
 'cotton': 'First Aid',
 'eye': 'Eyewear',
 'gauze': 'First Aid',
 ...}

Notice these reference dictionaries are created outside the mapping function to avoid rebuilding dictionaries for every call to map_company(). Finally the mapping function quickly returns the desired output using the reversed dictionary by calling .get(), a method that returns the default argument "Other" if no entry is found.

See @Flynsee's insightful answer for an explanation of what is happening in your code. The code is cleaner compared a bevy of conditional statements.

Benefits

Since we have used dictionaries, the search time should be relatively fast, O(1) compared to a O(n) complexity using in. Moreover, the main LOOKUP dictionary is adaptable and liberated from manually implementing extensive conditional statements for new entries.

Simplifying a list into categories

2 Answers2