1

I am a new Python developer and was wondering if someone can help me with this. I have a dataset that has one column that describes a company type. I noticed that the column has, for example, surgical, surgery listed. It has eyewear, eyeglasses and optometry listed. So instead of having a huge list in this column, i want to simply the category to say that if you find a word that contains "eye," "glasses" or "opto" then just change it to "eyewear." My initial code looks like this:

def map_company(row):
    company = row['SIC_Desc']
    if company in 'Surgical':
         return 'Surgical'
    elif company in ['Eye', 'glasses', 'opthal', 'spectacles', 'optometers']:
        return 'Eyewear'
    elif company in ['Cotton', 'Bandages', 'gauze', 'tape']:
        return 'First Aid'
    elif company in ['Dental', 'Denture']:
        return 'Dental'
    elif company in ['Wheelchairs', 'Walkers', 'braces', 'crutches', 'ortho']:
        return 'Mobility equipments'
    else:
        return 'Other'

df['SIC_Desc'] = df.apply(map_company,axis=1)

This is not correct though because it is changing every item into "Other," so clearly my syntax is wrong. Can someone please help me simplify this column that I am trying to relabel? Thank you

AChampion
  • 29,683
  • 4
  • 59
  • 75
  • Have verified the value(s) that are going into `company`? – Scott Hunter Sep 15 '17 at 02:41
  • Why not just use a debugger? Debuggers are your friend, take the opportunity to learn to use one! – donkopotamus Sep 15 '17 at 02:42
  • Can you also post the dataset that you are using? – WhatsYourIdea Sep 15 '17 at 02:42
  • 1
    Sounds like you have your condition backwards. You want `'Dental' in company`. You can use `any()` to get the desired effect of comparing multiple values, e.g.: `elif any(i in company for i in ['Dental', 'Denture']): ...` – AChampion Sep 15 '17 at 02:42
  • Hello Champion can you expand on your code? It doesn't seem to work. This is what I have: def map_company(row): company = row['SIC_Desc'] if any(i in company for i in ['Surgical', 'Surgery']): return 'Surgical' elif any(i in company for i in ['Eye', 'glasses', 'opthal', 'spectacles', 'optometers']): return 'Eyewear' elif any(i in company for i in ['Cotton', 'Bandages', 'gauze', 'tape']): return 'First Aid' else: return 'Other' df['SIC_Desc'] = df.apply(map_company,axis=1) – afzaaldeveloper1 Sep 15 '17 at 03:06
  • i would suggest using pandas library for dataset operation which will help you to manipulate the data easily. i you like this i could give you examples – Midhun Mohan Sep 15 '17 at 04:37
  • Yes please give me examples – afzaaldeveloper1 Sep 15 '17 at 04:40

2 Answers2

4

It is hard to answer without having the exact content of your data set, but I can see one mistake. According to your description, it seems you are looking at this the wrong way. You want one of the words to be in your company description, so it should look like that:

if any(test in company for test in ['Eye', 'glasses', 'opthal', 'spectacles', 'optometers'])

However you might have a case issue here so I would recommend:

company = row['SIC_Desc'].lower()
if any(test.lower() in company for test in ['Eye', 'glasses', 'opthal', 'spectacles', 'optometers']):
    return 'Eyewear'

You will also need to make sure company is a string and 'SIC_Desc' is a correct column name.

In the end your function will look like that:

def is_match(company,names):
    return any(name in company for name in names)

def map_company(row):
    company = row['SIC_Desc'].lower()
    if 'surgical' in company:
         return 'Surgical'
    elif is_match(company,['eye','glasses','opthal','spectacles','optometers']):
        return 'Eyewear'
    elif is_match(company,['cotton', 'bandages', 'gauze', 'tape']):
        return 'First Aid'
    else:
        return 'Other'
Flynsee
  • 596
  • 4
  • 17
0

Here is an option using a reversed dictionary.

Code

import pandas as pd


# Sample DataFrame
s = pd.Series(["gauze", "opthal", "tape", "surgical", "eye", "spectacles", 
               "glasses",  "optometers", "bandages", "cotton", "glue"])
df = pd.DataFrame({"SIC_Desc": s})
df

enter image description here

LOOKUP = {
    "Eyewear": ["eye", "glasses", "opthal", "spectacles", "optometers"],
    "First Aid": ["cotton", "bandages", "gauze", "tape"],
    "Surgical": ["surgical"],
    "Dental": ["dental", "denture"],
    "Mobility": ["wheelchairs", "walkers", "braces", "crutches", "ortho"],
}

REVERSE_LOOKUP = {v:k for k, lst in LOOKUP.items() for v in lst}

def map_company(row):
    company = row["SIC_Desc"].lower()
    return REVERSE_LOOKUP.get(company, "Other")


df["SIC_Desc"] = df.apply(map_company, axis=1)
df

enter image description here


Details

We define a LOOKUP dictionary with (key, value) pairs of expected output and associated words, respectively. Note, the values are lowercase to simplify searching. Then we use a reversed dictionary to automatically invert the key value pairs and improve the search performance, e.g.:

>>> REVERSE_LOOKUP
{'bandages': 'First Aid',
 'cotton': 'First Aid',
 'eye': 'Eyewear',
 'gauze': 'First Aid',
 ...}

Notice these reference dictionaries are created outside the mapping function to avoid rebuilding dictionaries for every call to map_company(). Finally the mapping function quickly returns the desired output using the reversed dictionary by calling .get(), a method that returns the default argument "Other" if no entry is found.

See @Flynsee's insightful answer for an explanation of what is happening in your code. The code is cleaner compared a bevy of conditional statements.

Benefits

Since we have used dictionaries, the search time should be relatively fast, O(1) compared to a O(n) complexity using in. Moreover, the main LOOKUP dictionary is adaptable and liberated from manually implementing extensive conditional statements for new entries.

pylang
  • 40,867
  • 14
  • 129
  • 121