Unable to extract the time from tycho.usno.navy.mil/timer.html with regex

Question

I need to extract the time from US Naval Observatory Master Clock Time webpage for EDT, MDT from the mentioned URL. I've been trying to extract it using re.findall but I am unable. I am using the following regex \d{2}\:\d{2}\:\d{2}\s(AM|PM)\s(MDT|PDT). The output is only PM and MDT or PDT.

https://regex101.com is a great way to try out and test your regular expressions. You can put the test output from the web site as a "test string" and then work out different expressions and graphically see what's matching and what's not. — payne, Sep 19 '18 at 00:26

zwer · Answer 1 · 2018-09-19T00:38:54.997

First of all, that's a HTML page and using regex with HTML (or any nested/hierarchical data) is a bad idea. That being said, given the relative simplicity of the page we can let it slide in this instance but keep in mind that this is not the recommended way of doing things.

Your issue is that re.findall() returns only the captured groups ((AM|PM) and (MDT|PDT)) if your pattern contains capturing groups. You can turn them into non-capturing groups to collect the whole pattern, i.e.:

matches = re.findall(r"\d{2}:\d{2}:\d{2}\s(?:AM|PM)\s(?:MDT|PDT)", your_data)

Or, alternatively, you can use re.finditer() and extract the matches:

matches = [x.group() for x in re.finditer(r"\d{2}:\d{2}:\d{2}\s(AM|PM)\s(MDT|PDT)", data)]

It worked!! Thank you. Agreed, that it ain't the best way. But sometimes you have restrictions. That's I had to use re.findall(). — Shazam, Oct 13 '18 at 20:32

Unable to extract the time from tycho.usno.navy.mil/timer.html with regex

1 Answers1