0

I'm trying to use numpy.select to replace string values within a column; if string contains a keyword, I need the whole string to be replaced with another keyword (there are +- 25 combinations).

df["new_col"] = np.select(
    condlist=[
        df["col"].str.contains("cat1", na=False, case=False),
        df["col"].str.contains("cat2", na=False, case=False),
        df["col"].str.contains("cat3", na=False, case=False),
        df["col"].str.contains("cat4", na=False, case=False),
        # ...
        df["col"].str.contains("cat25", na=False, case=False),
    ],
    choicelist=[
        "NEW_cat1",
        "NEW_cat2",
        "NEW_cat3",
        "NEW_cat4",
        # ...
        "NEW_cat25"
    ],
    default="DEFAULT_cat",
)

Is there a more concise way, or should I just repeat str.contains(...) within condlist 25 times?; is numpy.select the proper way to do it, at all?

I assume dict could be used here, but don't see how exactly.

df["col"].map(d) where d is a dict with old and new values like {"cat1":"NEW_cat1"} wouldn't work (?) since I can't hardcode exact values that need to be replaced (and that's why I'm using str.contains).

help-ukraine-now
  • 3,850
  • 4
  • 19
  • 36

3 Answers3

2

Should be able to use str.extract and then map the matches.

Setup

import pandas as pd
import re

df = pd.DataFrame({'col': ['foo', 'foOBar', 'oRange', 'manGo', 'i LIKE PIZZA', 
                           'some sentence with foo', 'foo and PizzA']})

cat_list = ['foo', 'orange', 'pizza']  # all lower case
label_l = ['Label_foo', 'Label_orange', 'Label_pizza']

Code

patt = re.compile('('+'|'.join(cat_list)+')', re.IGNORECASE)

df['new_col'] = (df.col.str.extract(patt)[0]  # First label in str if multiple
                   .str.lower()
                   .map(dict(zip(cat_list, label_l)))
                   .fillna('DEFAULT_LABEL'))

                      col        new_col
0                     foo      Label_foo
1                  foOBar      Label_foo
2                  oRange   Label_orange
3                   manGo  DEFAULT_LABEL
4            i LIKE PIZZA    Label_pizza
5  some sentence with foo      Label_foo
6           foo and PizzA      Label_foo

If there's a chance for multiple matches and we need to implement a hierarchy where 'pizza' should be prioritized over 'foo' we can add a few more steps using an ordered category dtype.

cat_list = ['pizza', 'orange', 'foo']  # ordered in priority
label_l = ['Label_pizza', 'Label_orange', 'Label_foo']

my_cat = pd.api.types.CategoricalDtype(categories=cat_list, ordered=True)

s = (df.col.str.extractall(patt)[0]
       .str.lower()
       .astype(my_cat))

df['new_col'] = (s.to_frame().groupby(level=0).min()[0]  # min gets priority
                  .map(dict(zip(cat_list, label_l))))
df['new_col'] = df['new_col'].astype(str).replace('nan', 'DEFAULT_LABEL')
#                      col        new_col
#0                     foo      Label_foo
#1                  foOBar      Label_foo
#2                  oRange   Label_orange
#3                   manGo  DEFAULT_LABEL
#4            i LIKE PIZZA    Label_pizza
#5  some sentence with foo      Label_foo
#6           foo and PizzA    Label_pizza
ALollz
  • 57,915
  • 7
  • 66
  • 89
2

The contents you are passing as the condlist and choicelist parameters are ordinary Python lists. List contents can be produced in a concise way in the language by using list comprehensions, that is the syntax: [expression_using_item for item in sequence]

In other words, your code can be written as:

df["new_col"] = np.select(
    condlist=[
       df["col"].str.contains(f"cat{i}", na=False, case=False) for i in range(1, 26)],        
    choicelist=[f"NEW_cat{i}" for i in range(1, 26)],
    default="DEFAULT_cat",
)

(and if the category names is not a numeric sequence, and you are giving these names here just as an example, you create a sequence (list) with all the explicit category names and plug that in place of the range() call in the snippet above)

jsbueno
  • 99,910
  • 10
  • 151
  • 209
2

Based off of this answer to a similar question, and this one, a simple solution:

import pandas as pd
import string

# Preparing test data
test_cont = [f"cat_{i}" for i in string.ascii_lowercase]
test_rep = [f"cat_{i}" for i in range(27)]

kv = zip(test_cont, test_rep)

test_df_data = zip(range(27), test_cont)

test_df = pd.DataFrame(data=test_df_data, columns=["some_col", "str_vals"])


# The solution itself
for (cont, rep) in kv:
    cont_mask = test_df["str_vals"].str.contains(cont, na=False, case=False)
    test_df.loc[cont_mask, "str_vals"] = rep
AMC
  • 2,642
  • 7
  • 13
  • 35