Use Regex re.sub to remove everything before and including a specified word

Question

I've got a string, which looks like "Blah blah blah, Updated: Aug. 23, 2012", from which I want to use Regex to extract just the date Aug. 23, 2012. I found an article in the stacks which has something similar: regex to remove all text before a character, but that's not working either when I tried

date_div = "Blah blah blah, Updated: Aug. 23, 2012"
extracted_date = re.sub('^[^Updated]*',"", date_div)

How can I remove everything up to and including Updated, so that only Aug. 23, 2012is left over?

Thanks!

score 15 · Accepted Answer · answered Jul 30 '14 at 19:36

15

In this case, you can do it withot regex, e.g:

>>> date_div = "Blah blah blah, Updated: Aug. 23, 2012"
>>> date_div.split('Updated: ')
['Blah blah blah, ', 'Aug. 23, 2012']
>>> date_div.split('Updated: ')[-1]
'Aug. 23, 2012'

answered Jul 30 '14 at 19:36

mkriheli

1,788
10
18

Works so well. Thanks :) – maudulus Jul 30 '14 at 19:38

score 7 · Answer 2 · answered Feb 13 '19 at 08:20

With a regex, you may use two regexps depending on the occurrence of the word:

# Remove all up to the first occurrence of the word including it (non-greedy):
^.*?word
# Remove all up to the last occurrence of the word including it (greedy):
^.*word

See the non-greedy regex demo and a greedy regex demo.

The ^ matches the start of string position, .*? matches any 0+ chars (mind the use of re.DOTALL flag so that . could match newlines) as few as possible (.* matches as many as possible) and then word matches and consumes (i.e. adds to the match and advances the regex index) the word.

Note the use of re.escape(up_to_word): if your up_to_word does not consist of sole alphanumeric and underscore chars, it is safer to use re.escape so that special chars like (, [, ?, etc. could not prevent the regex from finding a valid match.

See the Python demo:

import re

date_div = "Blah blah\nblah, Updated: Aug. 23, 2012 Blah blah Updated: Feb. 13, 2019"

up_to_word = "Updated:"
rx_to_first = r'^.*?{}'.format(re.escape(up_to_word))
rx_to_last = r'^.*{}'.format(re.escape(up_to_word))

print("Remove all up to the first occurrence of the word including it:")
print(re.sub(rx_to_first, '', date_div, flags=re.DOTALL).strip())
print("Remove all up to the last occurrence of the word including it:")
print(re.sub(rx_to_last, '', date_div, flags=re.DOTALL).strip())

Output:

Remove all up to the first occurrence of the word including it:
Aug. 23, 2012 Blah blah Updated: Feb. 13, 2019
Remove all up to the last occurrence of the word including it:
Feb. 13, 2019

score 6 · Answer 3 · answered Jul 30 '14 at 19:33

6

You can use Lookahead:

import re
date_div = "Blah blah blah, Updated: Aug. 23, 2012"
extracted_date = re.sub('^(.*)(?=Updated)',"", date_div)
print extracted_date

OUTPUT

Updated: Aug. 23, 2012

EDIT
If MattDMo's comment below is correct and you want to remove the "Update: " as well you can do:

extracted_date = re.sub('^(.*Updated: )',"", date_div)

answered Jul 30 '14 at 19:33

Nir Alfasi

53,191
11
86
129

1

I think OP wants to remove `Updated: ` as well – MattDMo Jul 30 '14 at 19:34
Works so well. Thanks :) – maudulus Jul 30 '14 at 19:39

Use Regex re.sub to remove everything before and including a specified word

3 Answers3

Linked