0

I am trying to use pandas str.replace function to replace a pattern.

But when i do:

pd.DataFrame({'text_col':['aaa', 'c', 'bbbbb', 'ddd']})['text_col'].str.replace('.*', 'RR')

it for some reason returns:

0    RRRR
1    RRRR
2    RRRR
3    RRRR
Name: text_col, dtype: object

While i would have though it should return the same as:

pd.DataFrame({'text_col':['aaa', 'c', 'bbbbb', 'ddd']})['text_col'].str.replace('^.*$', 'RR')

which returns:

0    RR
1    RR
2    RR
3    RR
Name: text_col, dtype: object

If i compare this behavior to R programming language, replacing the pattern .* and ^.*$ yields the same result. Why is it different in Pandas?

ira
  • 2,542
  • 2
  • 22
  • 36

1 Answers1

0

Both the regex patterns are different.

  • a* -> Zero or more of a.

Take a look at this example.

>>> import re
>>> re.findall('.*', 'c')
# ['c', '']

>>> re.findall('.*', 'AAAAAAA')
# ['AAAAAAA', '']

>>> re.findall('.*', '')
# [''] 
  • '.*' matches empty strings too. _.str.replace replaces every match, so you always get two match i.e. one the actual string, two an empty string. So, you always get 'RRRR'.

If you want to match one or match character you can use the below regex.

pat = r'.{1, }'
Ch3steR
  • 20,090
  • 4
  • 28
  • 58
  • oh wow, didn't occur to me that it would be also matching empty string in a nonempty string. I though that the .* would just match the entire string. – ira Feb 25 '21 at 12:25