0

I have a web page that contains a <td> tag, for example

<td>Aug 17, 2017 02:00 PM EDT</td>

I'm trying to use regex to find content in the page matching the format of , then a space then four numbers then a space then two numbers then : then two numbers space two capital letters space three capital letters. Just to make sure I always target that date and not accidentally get something else.

I don't think another instance of that format would ever occur, but I'd want the first instance. I guess I could just grab the [0] position in the returned variable to be sure I get the correct date.

I'm have the following regex so far:

(?=\,\s\d{4}\s\d{2}\:\d{2}\s[A-Z]{2}\s[A-Z]{3})(.*)(?=\<\/td)

So, in python code:

date = re.findall(r'(?=\,\s\d{4}\s\d{2}\:\d{2}\s[A-Z]{2}\s[A-Z]{3})(.*)(?=\<\/td)', page)
print(date[0])

This gets me close, but not quite all the way. It gets me

, 2017 02:00 PM EDT

Whereas I need

Aug 17, 2017 02:00 PM EDT

But I can't figure out how to extend the regex to grab all of the td. Thanks for any help!

(btw, Python 3)

Edit adding decode

page = response.read().decode('utf-8')
Kenny
  • 2,124
  • 3
  • 33
  • 63
  • 2
    Not really a duplicate, but you're going to want to read this: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?rq=1 – mypetlion Jan 22 '18 at 17:50

2 Answers2

1

Place a regex group to match Aug 17, 2017 02:00 PM EDT between the td tags:

import re
s = "<td>Aug 17, 2017 02:00 PM EDT</td>"
new_s = re.findall('<td>([a-zA-Z]+\s\d+,\s\d{4}\s[0-9\:]+\s[a-zA-Z\s]+)</td>', s)[0]

Output:

'Aug 17, 2017 02:00 PM EDT'
Ajax1234
  • 69,937
  • 8
  • 61
  • 102
  • Great thank you, very straight forward and works perfectly! For things like `` I don't have to escape `<` like `\<`, right? – Kenny Jan 22 '18 at 18:03
  • @Kenny Glad to help! You are correct, `<` and `>` do not need to be escaped. – Ajax1234 Jan 22 '18 at 18:07
1

You forgot to grab all the content before the first comma.

<td>(?=.*\,\s\d{4}\s\d{2}\:\d{2}\s[A-Z]{2}\s[A-Z]{3})(.*)(?=\<\/td)

Also, you have to put the opening in the regex before your group, so the regex won't grab it.

Regex101 test: https://regex101.com/r/yxqE6Q/1