why pandas str.replace with .* pattern insers replacement value multiple times

Question

I am trying to use pandas str.replace function to replace a pattern.

But when i do:

pd.DataFrame({'text_col':['aaa', 'c', 'bbbbb', 'ddd']})['text_col'].str.replace('.*', 'RR')

it for some reason returns:

0    RRRR
1    RRRR
2    RRRR
3    RRRR
Name: text_col, dtype: object

While i would have though it should return the same as:

pd.DataFrame({'text_col':['aaa', 'c', 'bbbbb', 'ddd']})['text_col'].str.replace('^.*$', 'RR')

which returns:

0    RR
1    RR
2    RR
3    RR
Name: text_col, dtype: object

If i compare this behavior to R programming language, replacing the pattern .* and ^.*$ yields the same result. Why is it different in Pandas?

score 0 · Accepted Answer · answered Feb 25 '21 at 12:03

Both the regex patterns are different.

Take a look at this example.

>>> import re
>>> re.findall('.*', 'c')
# ['c', '']

>>> re.findall('.*', 'AAAAAAA')
# ['AAAAAAA', '']

>>> re.findall('.*', '')
# ['']

'.*' matches empty strings too. _.str.replace replaces every match, so you always get two match i.e. one the actual string, two an empty string. So, you always get 'RRRR'.

If you want to match one or match character you can use the below regex.

pat = r'.{1, }'

oh wow, didn't occur to me that it would be also matching empty string in a nonempty string. I though that the .* would just match the entire string. — ira, Feb 25 '21 at 12:25

1 Answers1