-1

I was to use regex to replace a substring of a matched string in a df series. I have looked through the documentation (e.g. HERE ) and I have found a solution that is able to capture the specific type of string that I want to match. However, during the replace, it does not replace the substring.

I have cases such as

data
initthe problem
nationthe airline
radicthe groups
professionthe experience
the cat in the hat

In this particular case, I am interested in substituting "the" with "al" in those cases where "the" is not a standalone string (i.e. preceeded and followed by whitespaces).

I have tried the following solution:

patt = re.compile(r'(?:[a-z])(the)')
df['data'].str.replace(patt, r'al')

However, it also replaces the non-whitespace character preceding the "the".

Any suggestions on how what I can do to just repalce those specific cases of a substring?

owwoow14
  • 1,694
  • 8
  • 28
  • 43
  • But `inithe` will turn into `inial`, I guess you need `initial`? Even if you fix it to `df['data'].str.replace(r'(?<=[a-z])the', r'al')` – Wiktor Stribiżew Oct 08 '18 at 10:34

1 Answers1

1

Try using a lookbehind, which checks (asserts) for a character before the, but does not actually consume anything:

input = "data\ninitthe problem\nnationthe airline\nradicthe groups\nprofessionthe experience\nthe cat in the hat"

output = re.sub(r'(?<=[a-z])the', 'al', input)
print(output)

data
inital problem
national airline
radical groups
professional experience
the cat in the hat

Demo

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360