Replace exact substrings in a column based on a table / list / dataframe in python

Question

I have a pandas dataframe df, where one column has a string in it:

columnA
'PSX - Judge A::PSK-Ama'
'VSC - Jep::VSC-Da'
'VSO - Jep::VSO-Da'
...

And I have another dataframe, where I have the new strings:

old new
PSX PCC
VSO VVV

My desired outcome would be:

columnA
'PCC - Judge A::PCC-Ama'
'VSC - Jep::VSC-Da'
'VVV - Jep::VVV-Da'
...

My idea would be to write:

import re
df['columnA'] = df.replace('PSX', 'PCC', regex=True)
df['columnA'] = df.replace('VSO', 'VVV', regex=True)

for two replacements it is ok, but how to do it for severel replacements? Is there a smarter way to do it?

The dataframe you get here (thx to Daniel):

df = pd.DataFrame(data=['PSX - Judge A::PSK-Ama',
                        'VSC - Jep::VSC-Da',
                        'VSO - Jep::VSO-Da'], columns=['columnA'])
replace = pd.DataFrame(data=[['PSX', 'PCC'],
                             ['VSO', 'VVV']], columns=['old', 'new'])

Why did not you involve "another dataframe" ? What's the sense of mentioning it? — RomanPerekhrest, Nov 18 '19 at 14:23

score 1 · Answer 1 · answered Nov 18 '19 at 14:26

1

for row in df_map.iterrows():
    df['columnA'] = df.replace(row[0], row[1], regex=True)

Where df_map is your mapping DataFrame.

answered Nov 18 '19 at 14:26

rbcvl

406
3
13

This takes 9.5315 seconds in my case (only measured once) – PV8 Nov 19 '19 at 08:02

Dani Mesejo · Accepted Answer · 2019-11-18T14:38:40.323

You could use the fact that the replacement parameter can be a function:

import pandas as pd

df = pd.DataFrame(data=['PSX - Judge A::PSK-Ama',
                        'VSC - Jep::VSC-Da',
                        'VSO - Jep::VSO-Da'], columns=['columnA'])

replace = pd.DataFrame(data=[['PSX', 'PCC'],
                             ['VSO', 'VVV']], columns=['old', 'new'])

lookup = dict(zip(replace.old, replace.new))


def repl(w, lookup=lookup):
    return lookup.get(w.group(), w.group())


df['columnA'] = df['columnA'].str.replace('\w+', repl)

print(df)

Output

                  columnA
0  PCC - Judge A::PSK-Ama
1       VSC - Jep::VSC-Da
2       VVV - Jep::VVV-Da

The idea is to extract the words in columnA and if it matches one in lookup replace it. This is inspired by this answer, in which bench-marking shows this to be the more competitive approach.

This takes 1.26359 seconds in my case (only measured once) – PV8 Nov 19 '19 at 08:01 — PV8, Nov 19 '19 at 08:01

score 1 · Answer 3 · answered Nov 18 '19 at 14:33

1

You can make a "replacement dictionary" out of your second dataframe and then iterate over the keys and values and meanwhile use str.replace. This solution should be quite fast:

replacements = dict(zip(df2['old'], df2['new']))

for k, v in replacements.items():
    df['columnA'] = df['columnA'].str.replace(k, v)

                  columnA
0  PCC - Judge A::PSK-Ama
1       VSC - Jep::VSC-Da
2       VVV - Jep::VVV-Da

answered Nov 18 '19 at 14:33

Erfan

40,971
8
66
78

This takes 3.5880 seconds in my case (only measured once) – PV8 Nov 19 '19 at 08:05

Replace exact substrings in a column based on a table / list / dataframe in python

3 Answers3