1

The following re:

import re
s = "the blue dog and blue cat wore 7 blue hats 9 days ago"
p = re.compile(r'blue (?P<animal>dog|cat)')
p.sub(r'\1',s)

results in,

'the dog and cat wore 7 blue hats 9 days ago'

Is it possible to write a re.sub such that:

import re
s = "the blue dog and blue cat wore 7 blue hats 9 days ago"
p = re.compile(r'blue (?P<animal>dog|cat)|(?P<numberBelowSeven>[0-7])|(?P<numberNotSeven>[8-9])')

results in,

'the animal and animal wore numberBelowSeven blue hats numberNotSeven days ago"

Curiously enough there are docs on replace strings galore and getting group names but not a well documented way to do both.

Community
  • 1
  • 1
zelusp
  • 3,500
  • 3
  • 31
  • 65

2 Answers2

1

You could use re.sub with a callback which returns matchobj.lastgroup:

import re

s = "the blue dog and blue cat wore 7 blue hats 9 days ago"
p = re.compile(r'blue (?P<animal>dog|cat)|(?P<numberBelowSeven>[0-7])|(?P<numberNotSeven>[8-9])')

def callback(matchobj):
    return matchobj.lastgroup

result = p.sub(callback, s)
print(result)

yields

the animal and animal wore numberBelowSeven blue hats numberNotSeven days ago

Note that if you are using Pandas, you could use Series.str.replace:

import pandas as pd

def callback(matchobj):
    return matchobj.lastgroup

df = pd.DataFrame({'foo':["the blue dog", "and blue cat wore 7 blue", "hats 9", 
                          "days ago"]})
pat = r'blue (?P<animal>dog|cat)|(?P<numberBelowSeven>[0-7])|(?P<numberNotSeven>[8-9])'
df['result'] = df['foo'].str.replace(pat, callback)
print(df)

yields

                        foo                                 result
0              the blue dog                             the animal
1  and blue cat wore 7 blue  and animal wore numberBelowSeven blue
2                    hats 9                    hats numberNotSeven
3                  days ago                               days ago

If you have nested named groups, you may need a more complicated callback which iterates through matchobj.groupdict().items() to collect all the relevant group names:

import pandas as pd

def callback(matchobj):
    names = [groupname for groupname, matchstr in matchobj.groupdict().items()
             if matchstr is not None]
    names = sorted(names, key=lambda name: matchobj.span(name))
    result = ' '.join(names)
    return result

df = pd.DataFrame({'foo':["the blue dog", "and blue cat wore 7 blue", "hats 9", 
                          "days ago"]})

pat=r'blue (?P<animal>dog|cat)|(?P<numberItem>(?P<numberBelowSeven>[0-7])|(?P<numberNotSeven>[8-9]))'

# pat=r'(?P<someItem>blue (?P<animal>dog|cat)|(?P<numberBelowSeven>[0-7])|(?P<numberNotSeven>[8-9]))'

df['result'] = df['foo'].str.replace(pat, callback)
print(df)

yields

                        foo                                            result
0              the blue dog                                        the animal
1  and blue cat wore 7 blue  and animal wore numberItem numberBelowSeven blue
2                    hats 9                    hats numberItem numberNotSeven
3                  days ago                                          days ago
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • This answers my question nicely - thank you! I've run into an unanticipated problem - the groups who's names I'm using to replace bits of text are actually within a group themselves. How would one modify this to work in the same way if `pat=r'(?Pblue (?Pdog|cat)|(?P[0-7])|(?P[8-9]))'`? – zelusp Apr 29 '16 at 18:40
  • To play nicely with the examples we already have it's probably better to ask *can your solution be extended to return group names within the* `numberItem` *group assuming* `pat=r'blue (?Pdog|cat)|(?P(?P[0-7])|(?P[8-9]))'` – zelusp Apr 29 '16 at 18:55
  • 1
    I've added an alternative callback which can handle nested named groups. – unutbu Apr 29 '16 at 19:20
0

Why don't call re.sub() multiple times:

>>> s = re.sub(r"blue (dog|cat)", "animal", s)
>>> s = re.sub(r"\b[0-7]\b", "numberBelowSeven", s)
>>> s = re.sub(r"\b[8-9]\b", "numberNotSeven", s)
>>> s
'the animal and animal wore numberBelowSeven blue hats numberNotSeven days ago'

You can then put it into a "list of changes" and apply one by one:

>>> changes = [
...     (re.compile(r"blue (dog|cat)"), "animal"),
...     (re.compile(r"\b[0-7]\b"), "numberBelowSeven"),
...     (re.compile(r"\b[8-9]\b"), "numberNotSeven")
... ]
>>> s = "the blue dog and blue cat wore 7 blue hats 9 days ago"
>>> for pattern, replacement in changes:
...     s = pattern.sub(replacement, s)
... 
>>> s
'the animal and animal wore numberBelowSeven blue hats numberNotSeven days ago'

Note that I've additionally added the word boundary checks (\b).

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Because I need to do this on 20+ groups that collectively search for 40+ terms on a pandas dataframe 200,000 rows tall – zelusp Apr 29 '16 at 18:03
  • He wanted the "blue animal" to turn into "animal" – lazary Apr 29 '16 at 18:03
  • @zelusp okay, but you can still do it, making a list of patterns and replacements and applying in iteratively, added a sample to the answer. – alecxe Apr 29 '16 at 18:09