4

This is a follow up to this SO post which gives a solution to replace text in a string column

How to replace text in a column of a Pandas dataframe?

df['range'] = df['range'].str.replace(',','-')

However, this doesn't seem to work with double periods or a question mark followed by a period

testList = ['this is a.. test stence', 'for which is ?. was a time']
testDf = pd.DataFrame(testList, columns=['strings'])
testDf['strings'].str.replace('..', '.').head()

results in

0     ...........e
1    .............
Name: strings, dtype: object

and

testDf['strings'].str.replace('?.', '?').head()

results in

error: nothing to repeat at position 0
SantoshGupta7
  • 5,607
  • 14
  • 58
  • 116
  • `testDf['strings'] = testDf['strings'].str.replace('\..', '.')` A slah is required because a `.` is a regex character. – David Erickson Jul 28 '20 at 20:35
  • `?` and `.` are special character in `regex`. You need to preceed them with the escape `\` character or use raw string. – Quang Hoang Jul 28 '20 at 20:36

5 Answers5

4

Add regex=False parameter, because as you can see in the docs, regex it's by default True:

-regex bool, default True

Determines if assumes the passed-in pattern is a regular expression: If True, assumes the passed-in pattern is a regular expression.

And ? . are special characters in regular expressions.
So, one way to do it without regex will be this double replacing:

testDf['strings'].str.replace('..', '.',regex=False).str.replace('?.', '?',regex=False)

Output:

                     strings
0     this is a. test stence
1  for which is ? was a time
MrNobody33
  • 6,413
  • 7
  • 19
2

Replace using regular expression. In this case, replace any sepcial character '.' followed immediately by white space. This is abit curly, I advice you go with @Mark Reed answer.

testDf.replace(regex=r'([.](?=\s))', value=r'')


                  strings
0     this is a. test stence
1  for which is ? was a time
wwnde
  • 26,119
  • 6
  • 18
  • 32
1

str.replace() works with a Regex where . is a special character which denotes "any" character. If you want a literal dot, you need to escape it: "\.". Same for other special Regex characters like ?.

Thomas Weller
  • 55,411
  • 20
  • 125
  • 222
1

To replace both the ? and . at the same time you can separate by | (the regex OR operator).

testDf['strings'].str.replace('\?.|\..', '.')

Prefix the .. with a \, because you need to escape as . is a regex character:

testDf['strings'].str.replace('\..', '.')

You can do the same with the ?, which is another regex character.

testDf['strings'].str.replace('\?.', '.')
David Erickson
  • 16,433
  • 2
  • 19
  • 35
1

First, be aware that the Pandas replace method is different from the standard Python one, which operates only on fixed strings. The Pandas one can behave as either the regular string.replace or re.sub (the regular-expression substitute method), depending on the value of a flag, and the default is to act like re.sub. So you need to treat your first argument as a regular expression. That means you do have to change the string, but it also has the benefit of allowing you to do both substitutions in a single call.

A regular expression isn't a string to be searched for literally, but a pattern that acts as instructions telling Python what to look for. Most characters just ask Python to match themselves, but some are special, and both . and ? happen to be in the special category.

The easiest thing to do is to use a character class to match either . or ? followed by a period, and remember which one it was so that it can be included in the replacement, just without the following period. That looks like this:

testDF.replace(regex=r'([.?])\.', value=r'\1')

The [.?] means "match either a period or a question mark"; since they're inside the [...], those normally-special characters don't need to be escaped. The parentheses around the square brackets tell Python to remember which of those two characters is the one it actually found. The next thing that has to be there in order to match is the period you're trying to get rid of, which has to be escaped with a backslash because this one's not inside [...].

In the replacement, the special sequence \1 means "whatever you found that matched the pattern between the first set of parentheses", so that's either the period or question mark. Since that's the entire replacement, the following period is removed.

Now, you'll notice I used raw strings (r'...') for both; that keeps Python from doing its own interpretation of the backslashes before replace can. If the replacement were just '\1' without the r it would replace them with character code 1 (control-A) instead of the first matched group.

Mark Reed
  • 91,912
  • 16
  • 138
  • 175