2

I have the following string:

s = '2014 2026 202 20 1000 1949 194 195092 20111a a2011a a2011 keep this text n0t th1s th0ugh 1 0 2015 2025 2026'

I want to replace with '' every part of this string which contains a number, except for those parts of the string that are in the year range 1950 to 2025. The resultant string would look like this (don't worry about the extraneous whitespace):

'2014          keep this text      2015 2025 '

So, effectively I want the brute-force removal of anything and everything remotely "numerical," except for something standalone (i.e. not part of another string, and of length 4 excluding whitespace) that resembles a year.

I know I can use this to remove everything containing digits:

re.sub('\w*[0-9]\w*', '', s)

But that doesn't return what I want:

'           keep this text        '

Here's my attempt at replacing anything that doesn't match the patterns listed below:

re.sub(r'^([A-Za-z]+|19[5-9]\d|20[0-1]\d|202[0-5])', '*', s)

Which returns:

'* 2026 202 20 1000 1949 194 195092 20111a a2011a a2011 keep this text n0t th1s th0ugh 1 0 2015 2025 2026'

I've been here and here, but wasn't able to find what I was looking for.

blacksite
  • 12,086
  • 10
  • 64
  • 109

3 Answers3

2

Regex isn't good at working with numbers. I would ditch regex and use a generator expression:

predicate= lambda w: (w.isdigit() and 1950<=int(w)<=2025) or not any(char.isdigit() for char in w)
print(' '.join(w for w in s.split() if predicate(w)))
Aran-Fey
  • 39,665
  • 11
  • 104
  • 149
  • Maybe... `(w.isdigit() and 1950<=int(w)<=2025) or w.isalpha())` ? – Jon Clements Jun 05 '17 at 15:13
  • @JonClements `isalpha()` isn't the same thing as not containing any digits. Any sort of punctuation or other special character would cause a word to be discarded. – Aran-Fey Jun 05 '17 at 15:16
1

I would do it like this because it's readable and easy to fix of to improve:

' '.join(
    filter(
        lambda word: (word.isdigit() and \
                      int(word) >= 1950 and \
                      int(word) <= 2025) or \
                     re.match(r'^[a-zA-Z]+$', word),
        s.split()
    )
)
# '2014 keep this text 2015 2025'
Fomalhaut
  • 8,590
  • 8
  • 51
  • 95
1

Short solution using re.findall() function:

s = '2014 2026 202 20 1000 1949 194 195092 20111a a2011a a2011 keep this text n0t th1s th0ugh 1 0 2015 2025 2026'
result = ''.join(re.findall(r'\b(19[5-9][0-9]|20[01][0-9]|202[0-5]|[a-z]+|[^0-9a-z]+)\b', s, re.I))

print(result)

The output:

2014           keep this text      2015 2025 
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105