Regex: replace all numbers and "number-like" strings except for years in range

Question

I have the following string:

s = '2014 2026 202 20 1000 1949 194 195092 20111a a2011a a2011 keep this text n0t th1s th0ugh 1 0 2015 2025 2026'

I want to replace with '' every part of this string which contains a number, except for those parts of the string that are in the year range 1950 to 2025. The resultant string would look like this (don't worry about the extraneous whitespace):

'2014          keep this text      2015 2025 '

So, effectively I want the brute-force removal of anything and everything remotely "numerical," except for something standalone (i.e. not part of another string, and of length 4 excluding whitespace) that resembles a year.

I know I can use this to remove everything containing digits:

re.sub('\w*[0-9]\w*', '', s)

But that doesn't return what I want:

'           keep this text        '

Here's my attempt at replacing anything that doesn't match the patterns listed below:

re.sub(r'^([A-Za-z]+|19[5-9]\d|20[0-1]\d|202[0-5])', '*', s)

Which returns:

'* 2026 202 20 1000 1949 194 195092 20111a a2011a a2011 keep this text n0t th1s th0ugh 1 0 2015 2025 2026'

I've been here and here, but wasn't able to find what I was looking for.

No, I'll eventually just strip those out. But that's an easy task. I'm more concerned about the number-like removal excluding years. — blacksite, Jun 05 '17 at 14:58

score 2 · Answer 1 · answered Jun 05 '17 at 15:09

2

Regex isn't good at working with numbers. I would ditch regex and use a generator expression:

predicate= lambda w: (w.isdigit() and 1950<=int(w)<=2025) or not any(char.isdigit() for char in w)
print(' '.join(w for w in s.split() if predicate(w)))

answered Jun 05 '17 at 15:09

Aran-Fey

39,665
11
104
149

Maybe... `(w.isdigit() and 1950<=int(w)<=2025) or w.isalpha())` ? – Jon Clements Jun 05 '17 at 15:13
@JonClements `isalpha()` isn't the same thing as not containing any digits. Any sort of punctuation or other special character would cause a word to be discarded. – Aran-Fey Jun 05 '17 at 15:16

Fomalhaut · Answer 2 · 2017-06-05T15:04:08.157

1

I would do it like this because it's readable and easy to fix of to improve:

' '.join(
    filter(
        lambda word: (word.isdigit() and \
                      int(word) >= 1950 and \
                      int(word) <= 2025) or \
                     re.match(r'^[a-zA-Z]+$', word),
        s.split()
    )
)
# '2014 keep this text 2015 2025'

edited Jun 05 '17 at 15:04

answered Jun 05 '17 at 15:00

Fomalhaut

8,590
8
51
95

Nice, but what about years in the range 1950-1999? – blacksite Jun 05 '17 at 15:03

score 1 · Accepted Answer · answered Jun 05 '17 at 15:16

Short solution using re.findall() function:

s = '2014 2026 202 20 1000 1949 194 195092 20111a a2011a a2011 keep this text n0t th1s th0ugh 1 0 2015 2025 2026'
result = ''.join(re.findall(r'\b(19[5-9][0-9]|20[01][0-9]|202[0-5]|[a-z]+|[^0-9a-z]+)\b', s, re.I))

print(result)

The output:

2014           keep this text      2015 2025

Regex: replace all numbers and "number-like" strings except for years in range

3 Answers3