0
import pandas as pd
dataframe = pd.DataFrame({'Data' : ['The **ALI**1929 for 90 days but not 77731929 ', 
                                       'For all **ALI**1952  28A 177945 ', 
                                       'But the **ALI**1914 and **ALI**1903 1912',],
                          'ID': [1,2,3]

                         })

Data    ID
0   The **ALI**1929 for 90 days but not 77731929    1
1   For all **ALI**1952 28A 177945                  2
2   But the **ALI**1914 and **ALI**1903 1912        3

My dataframe looks like what I have above. My goal is to replace the word OLDER with any number at or under 1929 that is associated with **ALI**. So **ALI**1929 would be **ALI**OLDER and ALI**1903 would also be **ALI**OLDER but **ALI**1952 would remain the same. From How to extract certain length of numbers from a string in python? I have tried

dataframe['older'] = dataframe['Data'].str.replace(r'(?<!\d)(\d{3})(?!\d)', 'OLDER')

But this doesnt work too well for what I want. I would like something like this as output

 Data        ID     older
0                 The ALI**OLDER for 90 days but not 77731929
1                 For all ALI**1952 28A 177945
2                 But the ALI**OLDER and ALI**OLDER 1912

How do I change my regex str.replace(r'(?<!\d)(\d{3})(?!\d)' to do so?

5 Answers5

1

You can use this

(?<=\*)(?:0\d{3}|1[0-8]\d{2}|19[0-2]\d)(?!\d)
  • (?<=\*) - Should be preceded by *
  • (?:0\d{3}|1[0-8]\d{2}|19[0-2]\d)
    • 0\d{3} - Matches any 4 digit number between 0000 to 0999
    • | - Alternation
    • 1[0-8]\d{2} - Matches any 4 digit number between 1000 to 1899
    • | - Alternation
    • 19[0-2]\d - Matches any 4 digit number 1900 to 1929
  • (?!\d) - Should not be followed by digit

Regex Demo

Code Maniac
  • 37,143
  • 5
  • 39
  • 60
0

Use str.extractall and np.where with str.replace:

nums = dataframe['Data'].str.extractall('(?<=\*\*ALI\*\*)(\d+)').astype(int).unstack()

dataframe['older'] = np.where(nums.le(1929).any(axis=1), 
                              dataframe['Data'].str.replace('(?<=\*\*ALI\*\*)(\d+)', 'OLDER'), 
                              dataframe['Data'])

Output

                                            Data  ID                                           older
0  The **ALI**1929 for 90 days but not 77731929    1  The **ALI**OLDER for 90 days but not 77731929 
1               For all **ALI**1952  28A 177945    2                For all **ALI**1952  28A 177945 
2       But the **ALI**1914 and **ALI**1903 1912   3      But the **ALI**OLDER and **ALI**OLDER 1912
Erfan
  • 40,971
  • 8
  • 66
  • 78
0

As I see, the regex should match **ALI**nnnn (nnnn - 4 digits) and:

  • The initial ** - should be deleted (always).
  • ALI** - should be left unchanged.
  • nnnn - should be optionally replaced with OLDER.

In this case, complex regex is not necessary. The whole logic can be contained in a "replacement" function.

Define it as follows:

def repl(mtch):
    g1, g2 = mtch.group(1), mtch.group(2)
    return g1 + (g2 if int(g2) > 1929 else 'OLDER')

Then use str.replace with this function:

df.Data = df.Data.str.replace(r'\*\*(ALI\*\*)(\d{4})(?!\d)', repl)

Note that I changed also the regex, defining 2 capturing groups.

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
0
dataframe.Data.str.replace(r"(?<=\*ALI[*]{2})1[0-9](?:(?:[0-4][0-9])|5[0-1])\b","OLDER")
Out[364]: 
0    The **ALI**OLDER for 90 days but not 77731929 
1                  For all **ALI**1952  28A 177945 
2        But the **ALI**OLDER and **ALI**OLDER 1912
Name: Data, dtype: object
  • (?<=\*ALI[*]{2}) preceeded by `*ALI**
  • 1[0-9] ie 10-19
  • (?: begin of the outer Non capturing group
    • (?:[0-4][0-9]) ie 00-49 but not captured
    • |5[01] ie 50-51
  • ) end of non- capturing group
  • \b boundary
Onyambu
  • 67,392
  • 3
  • 24
  • 53
0

define a custome repl callable and use it with str.replace

repl = lambda m: m.group(1) if int(m.group(1)) > 1929 else 'OLDER'
df.Data.str.replace(r'(?<=\*\*ALI\*\*)(\d+)', repl)

Out[662]:
0    The **ALI**OLDER for 90 days but not 77731929
1                  For all **ALI**1952  28A 177945
2        But the **ALI**OLDER and **ALI**OLDER 1912
Name: Data, dtype: object
Andy L.
  • 24,909
  • 4
  • 17
  • 29