2

Is it possible to replace strings from one column with corresponding strings from another columns in a pandas dataframe using only the pandas.Series.str methods? "No" is an acceptable answer so long as it's accompanied with the pandas version and relevant part of the docs.

Here's an example:

import pandas as pd
# version >= 0.19.2
df = pd.DataFrame(
    {
        'names': ['alice', 'bob', 'catherine', 'slagathor'],
        'hobbies': [
            'alice likes to knit',
            'bob likes to bowl',
            'plays with her cats',
            'slagathor burniates peasants for fun'
        ]
    }
)

def clean(df: pd.DataFrame) -> pd.Dataframe: ... # do the substitutions

assert all(
    clean(df).hobbies == pd.Series([
        'likes to knit',
        'likes to bowl',
        'plays with her cats',
        'burniates peasants for fun'
    ])
)

In this case, I'd like to omit the strings from the name column from the hobbies column, using something like

df.hobbies.str.replace('(' + df.names + r'\s*)?', '')  # doesn't work

So far, I've had to

import re
df['replaced'] = pd.Series(
    re.sub(f'^{df.names[i]} ?', '', df.hobbies[i]) for i in df.index
)  

as in the answer to Replace values from one column with another column Pandas DataFrame

yardstick17
  • 4,322
  • 1
  • 26
  • 33
Steven Kalt
  • 1,116
  • 15
  • 25
  • Worth noting, but not part of this question: solutions that call `re.sub` on concatenated strings need to sanitize those strings. My actual dataset had strings like `'bob++'` and `'slag]athor'` in them. `re.complie(rf'^{name}\s*')` will fail if you don't escape `r'[\[\](){}+*\\?]'` – Steven Kalt May 20 '19 at 21:41

2 Answers2

2

str.replace is a Series method, so it can be applied to each element of particular column, but there is not possible to refer to any other column.

So you have to import re and use re.sub instead, within a function applied to each row (so that this function can refer to other columns of the current row).

Your task can be performed in a single instruction:

df['replaced'] = df.apply(lambda row: re.sub(
    '^' + row.names + r'\s*', '', row.hobbies), axis=1)

This solution runs quicker than to create a Series with for loop inside and substitute under a column afterwards, because apply takes care of looping over the DataFrame, so the function applied is responsible only for generation of a value to be put in the current row.

An important factor concerning execution speed is also that you avoid location of the current row by index, each time in the loop.

Not also that your code would fail if index was other than consecutive numbers starting from 0. Try e.g. to create your DataFrame with index=np.arange(1, 5) parameter.

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
0

The apply with replace will do the job here

df.apply(lambda x: x['hobbies'].replace(x['names'],''),axis=1)

It takes every row of data frame and replace the 'names' in 'hobbies' with empty string

usct01
  • 838
  • 7
  • 18