Regex - Target containing formatting of date - get all content in that

Question

I have a web page that contains a <td> tag, for example

<td>Aug 17, 2017 02:00 PM EDT</td>

I'm trying to use regex to find content in the page matching the format of , then a space then four numbers then a space then two numbers then : then two numbers space two capital letters space three capital letters. Just to make sure I always target that date and not accidentally get something else.

I don't think another instance of that format would ever occur, but I'd want the first instance. I guess I could just grab the [0] position in the returned variable to be sure I get the correct date.

I'm have the following regex so far:

(?=\,\s\d{4}\s\d{2}\:\d{2}\s[A-Z]{2}\s[A-Z]{3})(.*)(?=\<\/td)

So, in python code:

date = re.findall(r'(?=\,\s\d{4}\s\d{2}\:\d{2}\s[A-Z]{2}\s[A-Z]{3})(.*)(?=\<\/td)', page)
print(date[0])

This gets me close, but not quite all the way. It gets me

, 2017 02:00 PM EDT

Whereas I need

Aug 17, 2017 02:00 PM EDT

But I can't figure out how to extend the regex to grab all of the td. Thanks for any help!

(btw, Python 3)

Edit adding decode

page = response.read().decode('utf-8')

Not really a duplicate, but you're going to want to read this: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?rq=1 — mypetlion, Jan 22 '18 at 17:50

score 1 · Accepted Answer · answered Jan 22 '18 at 17:52

1

Place a regex group to match Aug 17, 2017 02:00 PM EDT between the td tags:

import re
s = "<td>Aug 17, 2017 02:00 PM EDT</td>"
new_s = re.findall('<td>([a-zA-Z]+\s\d+,\s\d{4}\s[0-9\:]+\s[a-zA-Z\s]+)</td>', s)[0]

Output:

'Aug 17, 2017 02:00 PM EDT'

answered Jan 22 '18 at 17:52

Ajax1234

69,937
8
61
102

Great thank you, very straight forward and works perfectly! For things like `` I don't have to escape `<` like `\<`, right? – Kenny Jan 22 '18 at 18:03
@Kenny Glad to help! You are correct, `<` and `>` do not need to be escaped. – Ajax1234 Jan 22 '18 at 18:07

score 1 · Answer 2 · answered Jan 22 '18 at 17:56

You forgot to grab all the content before the first comma.

<td>(?=.*\,\s\d{4}\s\d{2}\:\d{2}\s[A-Z]{2}\s[A-Z]{3})(.*)(?=\<\/td)

Also, you have to put the opening in the regex before your group, so the regex won't grab it.

Regex101 test: https://regex101.com/r/yxqE6Q/1

Regex - Target containing formatting of date - get all content in that

2 Answers2