8

Beginner with python - I'm looking to create a dictionary mapping of strings, and the associated value. I have a dataframe and would like create a new column where if the string matches, it tags the column as x.

df = pd.DataFrame({'comp':['dell notebook', 'dell notebook S3', 'dell notepad', 'apple ipad', 'apple ipad2', 'acer chromebook', 'acer chromebookx', 'mac air', 'mac pro', 'lenovo x4'],
              'price':range(10)})

For Example I would like to take the above df and create a new column df['company'] and set it to a mapping of strings.

I was thinking of doing something like

product_map = {'dell':'Dell Inc.',
               'apple':'Apple Inc.',
               'acer': 'Acer Inc.',
               'mac': 'Apple Inc.',
               'lenovo': 'Dell Inc.'}

Then I wanted to iterate through it to check the df.comp column and see if each entry contained one of those strings, and to set the df.company column to the value in the dictionary.

Not sure how to do this correctly though.

martineau
  • 119,623
  • 25
  • 170
  • 301
Matt W.
  • 3,692
  • 2
  • 23
  • 46
  • Similar: https://stackoverflow.com/questions/48510405/pandas-python-datafame-update-a-column/48510563#48510563 – pault Feb 02 '18 at 20:42
  • @pault I'll update my example as your workaround isn't the issue I'm trying to solve. thank you for the suggestion though. – Matt W. Feb 02 '18 at 20:45
  • so 'dell notebook' should be replaced by 'Dell Inc.' or 'Dell Inc. notebook'? – Vaishali Feb 02 '18 at 20:47
  • Not should be replaced, the new column `df['Company']` should be populated with `Dell Inc.`, because the `key` is within the string `dell notebook` – Matt W. Feb 02 '18 at 20:48
  • Doesn't @aquil.abdullah's solution below achieve your goal? – pault Feb 02 '18 at 20:49
  • Almost.. I was able to edit it to be what I want. was just checking it – Matt W. Feb 02 '18 at 20:52

4 Answers4

10

There are many ways to do this. One way to do it would be the following:

def like_function(x):
    group = "unknown"
    for key in product_map:
        if key in x:
            group = product_map[key]
            break
    return group

df['company'] = df.comp.apply(like_function)
aquil.abdullah
  • 3,059
  • 3
  • 21
  • 40
  • This is close but not fully correct, as I'm looking for the `value` in the `key, value` pair to be the outputted result. – Matt W. Feb 03 '18 at 02:02
  • You probably figured it out, but I mad the change to return the value rather than the key of the product_map. – aquil.abdullah Feb 05 '18 at 04:26
5

Here is an interesting way, especially if you are learning about python. You can subclass dict and override __getitem__ to look for partial strings.

class dict_partial(dict):
    def __getitem__(self, value):
        for k in self.keys():
            if k in value:
                return self.get(k)
        else:
            return self.get(None)

product_map = dict_partial({'dell':'Dell Inc.', 'apple':'Apple Inc.',
                            'acer': 'Acer Inc.', 'mac': 'Apple Inc.',
                            'lenovo': 'Dell Inc.'})

df['company'] = df['comp'].apply(lambda x: product_map[x])

               comp  price     company
# 0     dell notebook      0   Dell Inc.
# 1  dell notebook S3      1   Dell Inc.
# 2      dell notepad      2   Dell Inc.
# 3        apple ipad      3  Apple Inc.
# 4       apple ipad2      4  Apple Inc.
# 5   acer chromebook      5   Acer Inc.
# 6  acer chromebookx      6   Acer Inc.
# 7           mac air      7  Apple Inc.
# 8           mac pro      8  Apple Inc.
# 9         lenovo x4      9   Dell Inc.

My only annoyance with this method is that subclassing dict does't override dict.get at the same time as [] syntax. If this were possible, we could get rid of the lambda and use df['comp'].map(product_map.get). There doesn't seem to be an obvious solution to this.

jpp
  • 159,742
  • 34
  • 281
  • 339
  • I think you're missing an element in your `dict` - `'mac': 'Apple Inc.'`. – pault Feb 02 '18 at 21:45
  • 1
    @pault, fixed again (i think). the beauty of it is that class instances are easy to create and reuse. it's just unfortunate you can't use it fully, i.e. `dict.get` doesn't work. – jpp Feb 02 '18 at 21:53
  • Yes, now that I'm exclusively using python one of the few things I miss from my C++ days is function overloading. – pault Feb 02 '18 at 22:02
4

To my knowledge, pandas does not come with a "substring mapping" method. The .map() method does not support substrings, and the .str.contains() method only works with regular expressions (which does not scale well).

You can achieve the result you are after by writing a simple function. You can then use .apply() with a lambda function to generate your desired 'company' column. The added benefits are that it keeps your code readable and you can reuse the functionality. Hope that helps.

This should give you the 'company' column you need:

def map_substring(s, dict_map):
    for key in dict_map.keys():
        if key in s: 
            return dict_map[key]
    return np.nan

df['company'] = df['product'].apply(lambda x: map_substring(x, product_map))
fpersyn
  • 1,045
  • 1
  • 12
  • 19
2

A vectorized solution inspired by MaxU's solution to a similar problem.

x = df.comp.str.split(expand=True)
df['company'] = None
df['company'] = df['company'].fillna(x[x.isin(product_map.keys())]\
                                     .ffill(axis=1).bfill(axis=1).iloc[:, 0])
df['company'].replace(product_map, inplace=True)
print(df)
#               comp  price     company
#0     dell notebook      0   Dell Inc.
#1  dell notebook S3      1   Dell Inc.
#2      dell notepad      2   Dell Inc.
#3        apple ipad      3  Apple Inc.
#4       apple ipad2      4  Apple Inc.
#5   acer chromebook      5   Acer Inc.
#6  acer chromebookx      6   Acer Inc.
#7           mac air      7  Apple Inc.
#8           mac pro      8  Apple Inc.
#9         lenovo x4      9   Dell Inc.
pault
  • 41,343
  • 15
  • 107
  • 149