RegEx: How can I match timecodes above a certain time?

Question

I'm writing a script to scour the metadata of YouTube videos and grab timecodes out of them, if any.

with urllib.request.urlopen("https://www.googleapis.com/youtube/v3/videos?id=m65QTeKRWNg&key=AIzaSyDls3PGTAKqbr5CqSmxt71fzZTNHZCQzO8&part=snippet") as url:
            data = json.loads(url.read().decode())

description = json.dumps(data['items'][0]['snippet']['description'], indent=4, sort_keys=True)
print(description)

This works fine, so I go ahead and find the timecodes.

# finds timecodes like 00:00
timeBeforeHour = re.findall(r'[\d\.-]+:[\d.-]+', description)

>>[''0:00', '6:00', '9:30', '14:55', '19:00', '23:23', '28:18', '33:33', '37:44', '40:04', '44:15', '48:00', '54:00', '58:18', '1:02', '1:06', '1:08', '1:12', '1:17', '1:20']

It goes beyond and grabs times after 59:00, but not correctly as it misses the final ":", so I grab the remaining set:

# finds timecodes like 00:00:00
timePastHour = re.findall(r'[\d\.-]+:[\d.-]+:[\d\.-]+', description)

>>['1:02:40', '1:06:10', '1:08:15', '1:12:25', '1:17:08', '1:20:34']

I want to concatenate them, but still have the issue of the incorrect times in the first regex. How can I stop the range of the first regex going above an hour i.e 59:59?

I look at regex and my head explodes a bit, any clarifacation would be super!

edit:

I've tried this:

description = re.findall(r'?<!\d:)(?<!\d)[0-5]\d:[0-5]\d(?!:?\d', description)

and this:

description = re.findall(r'^|[^\d:])([0-5]?[0-9]:[0-5][0-9])([^\d:]|$', description)

but I'm entering them wrong. What is it position of the regex doing?

Also for context, this is part of the sample I'm trying to strip:

 Naked\n1:02:40 Marvel 83' - Genesis\n1:06:10 Ward-Iz - The Chase\n1:08:15 Epoch - Formula\n1:12:25 Perturbator - Night Business\n1:17:08 Murkula - Death Code\n1:20:34 LAZERPUNK - Revenge\n\nPhotography by Jezael Melgoza"

What are the contexts you want to match them in? Why do you have `-` and `.` in the regex? `(?<!\d:)(?<!\d)[0-5]\d:[0-5]\d(?!:?\d)` should help. — Wiktor Stribiżew, Oct 10 '19 at 19:46
[Editing](https://stackoverflow.com/posts/58330071/edit) your question with the suggestions from both @WiktorStribiżew and MonkeyZeus will improve it. — Ross Jacobs, Oct 10 '19 at 19:53
@WiktorStribiżew the . and - were part of a guide I was reading for findall. I adapted it for timecode, I'm still not fully sure what they do. In context, this is the description: ```Tracks\n======\n0:00 Tonebox - Frozen Code\n6:00 SHIKIMO & DOOMROAR - Getaway\n9:30 d.notive - Streets of Passion\n14:55 Perturbator - Neo Tokyo\``` — Lukabratzee, Oct 10 '19 at 22:03
@Wiktor I entered your sample like this ```description = re.findall(r'?<!\d:)(?<!\d)[0-5]\d:[0-5]\d(?!:?\d', description)``` but it looks like I've entered it wrong. — Lukabratzee, Oct 10 '19 at 22:13
Yes it is wrong, use `results = re.findall(r'(?<!\d:)(?<!\d)[0-5]\d:[0-5]\d(?!:?\d)', description)` — Wiktor Stribiżew, Oct 10 '19 at 22:51

Wiktor Stribiżew · Accepted Answer · 2019-10-15T19:21:34.093

1

Use

results = re.findall(r'(?<!\d:)(?<!\d)[0-5]?\d:[0-5]\d(?!:?\d)', description)

See the regex demo.

It will match a time string when not inside a loner colon-separated digit string (like 11:22:22:33).

Explanation:

(?<!\d:) - a negative lookbehind that matches a location that is not immediately preceded with a digit and :
(?<!\d) - a negative lookbehind that matches a location that is not immediately preceded with a digit (a separate lookbehind is necessary because Python re lookbehind only accepts a fixed-width pattern)
[0-5]?\d - an optional digit from 0 to 5 and then any 1 digit
: - a colon
[0-5]\d - a digit from 0 to 5 and then any 1 digit
(?!:?\d) - a negative lookahead that matches a location that is not immediately followed with an optional : and a digit.

Python online demo:

import re
description = "Tracks\n======\n0:00 Tonebox - Frozen Code\n6:00 SHIKIMO & DOOMROAR - Getaway\n9:30 d.notive - Streets of Passion\n14:55 Perturbator - Neo Tokyo"
results = re.findall(r'(?<!\d:)(?<!\d)[0-5]?\d:[0-5]\d(?!:?\d)', description)
print(results) 
# => ['0:00', '6:00', '9:30', '14:55']

edited Oct 15 '19 at 19:21

answered Oct 12 '19 at 13:27

Wiktor Stribiżew

607,720
39
448
563

I attempted your suggestion. When I print results, I get [ ]. The array's empty for some reason. Curiously, using the regex that is meant to capture timecodes past one hour, it's now printing all times '00:53:44', '00:56:11', '00:58:45', '01:01:40', '01:04:42'] Upvoted because of that knowledgable breakdown, thank you! – Lukabratzee Oct 15 '19 at 17:20
@Lukabratzee Please check [this Python demo](https://ideone.com/XDRlzC), it outputs `['0:00', '6:00', '9:30', '14:55']` when the input is `"Tracks\n======\n0:00 Tonebox - Frozen Code\n6:00 SHIKIMO & DOOMROAR - Getaway\n9:30 d.notive - Streets of Passion\n14:55 Perturbator - Neo Tokyo"`. – Wiktor Stribiżew Oct 15 '19 at 19:20
1

that's perfect. It works as expected and is more comprehensive than my original attempt. Thank you for all your help. I've upvoted but I needed more reputation for it to actually reflect :p – Lukabratzee Oct 16 '19 at 11:02

MonkeyZeus · Answer 2 · 2019-10-10T20:05:39.083

0

I think this is what you are looking for:

(^|[^\d:])([0-5]?[0-9]:[0-5][0-9])([^\d:]|$)

https://regex101.com/r/yERoPi/1

edited Oct 10 '19 at 20:05

answered Oct 10 '19 at 19:57

MonkeyZeus

20,375
4
36
77

`(^|[^\d:])(\d{1,2}:\d{2})([^\d:]|$)` [won't match](https://regex101.com/r/PmhDZ5/1) `23:45` in `23:45:Description` and will also match `88:99`. – Wiktor Stribiżew Oct 10 '19 at 19:59
@WiktorStribiżew I took care of `88:99` in my updated regex but what makes `23:45:description` a time code versus `Football team XYZ has a 3:45:loss streak` – MonkeyZeus Oct 10 '19 at 20:02
@MonkeyZeus I tried adding that in as ```description = re.findall(r'^|[^\d:])([0-5]?[0-9]:[0-5][0-9])([^\d:]|$', description)``` but I've not entered it correctly. Believe me when I say I have no clue what each position in the regex means. A breakdown of it would go a long way! – Lukabratzee Oct 10 '19 at 22:11

RegEx: How can I match timecodes above a certain time?

2 Answers2