replace four digits pandas

Question

import pandas as pd
dataframe = pd.DataFrame({'Data' : ['The **ALI**1929 for 90 days but not 77731929 ', 
                                       'For all **ALI**1952  28A 177945 ', 
                                       'But the **ALI**1914 and **ALI**1903 1912',],
                          'ID': [1,2,3]

                         })

Data    ID
0   The **ALI**1929 for 90 days but not 77731929    1
1   For all **ALI**1952 28A 177945                  2
2   But the **ALI**1914 and **ALI**1903 1912        3

My dataframe looks like what I have above. My goal is to replace the word OLDER with any number at or under 1929 that is associated with **ALI**. So **ALI**1929 would be **ALI**OLDER and ALI**1903 would also be **ALI**OLDER but **ALI**1952 would remain the same. From How to extract certain length of numbers from a string in python? I have tried

dataframe['older'] = dataframe['Data'].str.replace(r'(?<!\d)(\d{3})(?!\d)', 'OLDER')

But this doesnt work too well for what I want. I would like something like this as output

 Data        ID     older
0                 The ALI**OLDER for 90 days but not 77731929
1                 For all ALI**1952 28A 177945
2                 But the ALI**OLDER and ALI**OLDER 1912

How do I change my regex str.replace(r'(?<!\d)(\d{3})(?!\d)' to do so?

with your regex it will match `1912` too, will the number you want to replace is always precede by `*` ? — Code Maniac, Aug 27 '19 at 19:10
check [`this`](https://regex101.com/r/TLAPrj/1/) is this what your looking for ? — Code Maniac, Aug 27 '19 at 19:13

Code Maniac · Accepted Answer · 2019-08-27T19:34:45.247

1

You can use this

(?<=\*)(?:0\d{3}|1[0-8]\d{2}|19[0-2]\d)(?!\d)

(?<=\*) - Should be preceded by *
(?:0\d{3}|1[0-8]\d{2}|19[0-2]\d)
- 0\d{3} - Matches any 4 digit number between 0000 to 0999
- | - Alternation
- 1[0-8]\d{2} - Matches any 4 digit number between 1000 to 1899
- | - Alternation
- 19[0-2]\d - Matches any 4 digit number 1900 to 1929
(?!\d) - Should not be followed by digit

Regex Demo

edited Aug 27 '19 at 19:34

answered Aug 27 '19 at 19:18

Code Maniac

37,143
5
39
60

Erfan · Answer 2 · 2019-08-27T19:21:04.013

Use str.extractall and np.where with str.replace:

nums = dataframe['Data'].str.extractall('(?<=\*\*ALI\*\*)(\d+)').astype(int).unstack()

dataframe['older'] = np.where(nums.le(1929).any(axis=1), 
                              dataframe['Data'].str.replace('(?<=\*\*ALI\*\*)(\d+)', 'OLDER'), 
                              dataframe['Data'])

Output

                                            Data  ID                                           older
0  The **ALI**1929 for 90 days but not 77731929    1  The **ALI**OLDER for 90 days but not 77731929 
1               For all **ALI**1952  28A 177945    2                For all **ALI**1952  28A 177945 
2       But the **ALI**1914 and **ALI**1903 1912   3      But the **ALI**OLDER and **ALI**OLDER 1912

this is really close. But it misses `1929` – Aug 27 '19 at 19:18 — , Aug 27 '19 at 19:18

Valdi_Bo · Answer 3 · 2019-08-27T19:46:54.513

As I see, the regex should match **ALI**nnnn (nnnn - 4 digits) and:

The initial ** - should be deleted (always).
ALI** - should be left unchanged.
nnnn - should be optionally replaced with OLDER.

In this case, complex regex is not necessary. The whole logic can be contained in a "replacement" function.

Define it as follows:

def repl(mtch):
    g1, g2 = mtch.group(1), mtch.group(2)
    return g1 + (g2 if int(g2) > 1929 else 'OLDER')

Then use str.replace with this function:

df.Data = df.Data.str.replace(r'\*\*(ALI\*\*)(\d{4})(?!\d)', repl)

Note that I changed also the regex, defining 2 capturing groups.

Onyambu · Answer 4 · 2019-08-27T19:35:53.090

dataframe.Data.str.replace(r"(?<=\*ALI[*]{2})1[0-9](?:(?:[0-4][0-9])|5[0-1])\b","OLDER")
Out[364]: 
0    The **ALI**OLDER for 90 days but not 77731929 
1                  For all **ALI**1952  28A 177945 
2        But the **ALI**OLDER and **ALI**OLDER 1912
Name: Data, dtype: object

(?<=\*ALI[*]{2}) preceeded by `*ALI**
1[0-9] ie 10-19
(?: begin of the outer Non capturing group
- (?:[0-4][0-9]) ie 00-49 but not captured
- |5[01] ie 50-51
) end of non- capturing group
\b boundary

score 0 · Answer 5 · answered Aug 27 '19 at 19:44

define a custome repl callable and use it with str.replace

repl = lambda m: m.group(1) if int(m.group(1)) > 1929 else 'OLDER'
df.Data.str.replace(r'(?<=\*\*ALI\*\*)(\d+)', repl)

Out[662]:
0    The **ALI**OLDER for 90 days but not 77731929
1                  For all **ALI**1952  28A 177945
2        But the **ALI**OLDER and **ALI**OLDER 1912
Name: Data, dtype: object

replace four digits pandas

5 Answers5