I have a web page that contains a <td>
tag, for example
<td>Aug 17, 2017 02:00 PM EDT</td>
I'm trying to use regex to find content in the page matching the format of ,
then a space
then four numbers
then a space
then two numbers
then :
then two numbers
space
two capital letters
space
three capital letters
. Just to make sure I always target that date and not accidentally get something else.
I don't think another instance of that format would ever occur, but I'd want the first instance. I guess I could just grab the [0]
position in the returned variable to be sure I get the correct date.
I'm have the following regex so far:
(?=\,\s\d{4}\s\d{2}\:\d{2}\s[A-Z]{2}\s[A-Z]{3})(.*)(?=\<\/td)
So, in python code:
date = re.findall(r'(?=\,\s\d{4}\s\d{2}\:\d{2}\s[A-Z]{2}\s[A-Z]{3})(.*)(?=\<\/td)', page)
print(date[0])
This gets me close, but not quite all the way. It gets me
, 2017 02:00 PM EDT
Whereas I need
Aug 17, 2017 02:00 PM EDT
But I can't figure out how to extend the regex to grab all of the td. Thanks for any help!
(btw, Python 3)
Edit adding decode
page = response.read().decode('utf-8')