3

I have a list of colors like this:

color = ['green', 'blue', 'red']

I have a Dataframe like this:

df:
col1        col2
 A        dark green
 B        sea blue
 C          blue
 D       exclusive red
 E          green
 F       pale red

I want to match col2 with the color list. If any word of col2 matches the element of the color list, replace it with the lists value.

The result data frame will be

 col1          col2
  A            green
  B            blue
  C            blue
  D            red
  E            green
  F            red

What is the most efficient way to do it using pandas?

Milo
  • 3,172
  • 3
  • 19
  • 21
Kallol
  • 2,089
  • 3
  • 18
  • 33

2 Answers2

1

Use Series.str.extract with joined values by | for regex OR, last add fillna for replace non matched values (NaNs) by original column:

print (df)
  col1           col2
0    A     dark green
1    B       sea blue
2    C           blue
3    D  exclusive red
4    E          green
5    F           pale <- not matched value

color=['green','blue','red']

pat = r'({})'.format('|'.join(color))
df['col2'] = df['col2'].str.extract(pat, expand=False).fillna(df['col2'])

print (df)
  col1   col2
0    A  green
1    B   blue
2    C   blue
3    D    red
4    E  green
5    F   pale
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • if the color doesnt't match how to keep keep the original color ? I don't want nan values there. – Kallol Jan 29 '19 at 10:02
  • @KallolSamanta You can check the other answer as well, it answers the question. I suggest timing all the solution and ticking the best one for you. Thanks. – cs95 Jan 29 '19 at 10:05
1

Use str.extract:

df['col2'] = df.col2.str.extract(f"({'|'.join(color)})", expand=False)
df

  col1   col2
0    A  green
1    B   blue
2    C   blue
3    D    red
4    E  green
5    F    red

For better performance, you can use a list comprehension that uses a precompiled regex pattern to perform re.search:

import re

p = re.compile(rf"({'|'.join(color)})")
def try_extract(s):
    try:
        return p.search(s).group(1)
    except (TypeError, AttributeError):
        return s

df['col2'] = [try_extract(s) for s in df['col2']
df

  col1   col2
0    A  green
1    B   blue
2    C   blue
3    D    red
4    E  green
5    F    red

If the color doesn't match how to keep keep the original color? I don't want nan values there.

This is automatically handled by try_except:

df2 = df.append(pd.Series(
    {'col1': 'G', 'col2': 'something else'}), ignore_index=True)
df2['col2'] = [try_extract(s) for s in df2['col2']]
df2

  col1            col2
0    A           green
1    B            blue
2    C            blue
3    D             red
4    E           green
5    F             red
6    G  something else   # other values are preserved.

For more information on why list comprehensions should be considered a competitive alternative, you can check For loops with pandas - When should I care?.

cs95
  • 379,657
  • 97
  • 704
  • 746