3

I'm writing a script to scour the metadata of YouTube videos and grab timecodes out of them, if any.

with urllib.request.urlopen("https://www.googleapis.com/youtube/v3/videos?id=m65QTeKRWNg&key=AIzaSyDls3PGTAKqbr5CqSmxt71fzZTNHZCQzO8&part=snippet") as url:
            data = json.loads(url.read().decode())

description = json.dumps(data['items'][0]['snippet']['description'], indent=4, sort_keys=True)
print(description)

This works fine, so I go ahead and find the timecodes.

# finds timecodes like 00:00
timeBeforeHour = re.findall(r'[\d\.-]+:[\d.-]+', description)

>>[''0:00', '6:00', '9:30', '14:55', '19:00', '23:23', '28:18', '33:33', '37:44', '40:04', '44:15', '48:00', '54:00', '58:18', '1:02', '1:06', '1:08', '1:12', '1:17', '1:20']

It goes beyond and grabs times after 59:00, but not correctly as it misses the final ":", so I grab the remaining set:

# finds timecodes like 00:00:00
timePastHour = re.findall(r'[\d\.-]+:[\d.-]+:[\d\.-]+', description)

>>['1:02:40', '1:06:10', '1:08:15', '1:12:25', '1:17:08', '1:20:34']

I want to concatenate them, but still have the issue of the incorrect times in the first regex. How can I stop the range of the first regex going above an hour i.e 59:59?

I look at regex and my head explodes a bit, any clarifacation would be super!

edit:

I've tried this:

description = re.findall(r'?<!\d:)(?<!\d)[0-5]\d:[0-5]\d(?!:?\d', description)

and this:

description = re.findall(r'^|[^\d:])([0-5]?[0-9]:[0-5][0-9])([^\d:]|$', description)

but I'm entering them wrong. What is it position of the regex doing?

Also for context, this is part of the sample I'm trying to strip:

 Naked\n1:02:40 Marvel 83' - Genesis\n1:06:10 Ward-Iz - The Chase\n1:08:15 Epoch - Formula\n1:12:25 Perturbator - Night Business\n1:17:08 Murkula - Death Code\n1:20:34 LAZERPUNK - Revenge\n\nPhotography by Jezael Melgoza"
Lukabratzee
  • 137
  • 1
  • 10
  • 2
    What are the contexts you want to match them in? Why do you have `-` and `.` in the regex? `(?<!\d:)(?<!\d)[0-5]\d:[0-5]\d(?!:?\d)` should help. – Wiktor Stribiżew Oct 10 '19 at 19:46
  • Can you provide samples of full descriptions? – MonkeyZeus Oct 10 '19 at 19:48
  • [Editing](https://stackoverflow.com/posts/58330071/edit) your question with the suggestions from both @WiktorStribiżew and MonkeyZeus will improve it. – Ross Jacobs Oct 10 '19 at 19:53
  • @WiktorStribiżew the . and - were part of a guide I was reading for findall. I adapted it for timecode, I'm still not fully sure what they do. In context, this is the description: ```Tracks\n======\n0:00 Tonebox - Frozen Code\n6:00 SHIKIMO & DOOMROAR - Getaway\n9:30 d.notive - Streets of Passion\n14:55 Perturbator - Neo Tokyo\``` – Lukabratzee Oct 10 '19 at 22:03
  • @Wiktor I entered your sample like this ```description = re.findall(r'?<!\d:)(?<!\d)[0-5]\d:[0-5]\d(?!:?\d', description)``` but it looks like I've entered it wrong. – Lukabratzee Oct 10 '19 at 22:13
  • Yes it is wrong, use `results = re.findall(r'(?<!\d:)(?<!\d)[0-5]\d:[0-5]\d(?!:?\d)', description)` – Wiktor Stribiżew Oct 10 '19 at 22:51

2 Answers2

1

Use

results = re.findall(r'(?<!\d:)(?<!\d)[0-5]?\d:[0-5]\d(?!:?\d)', description)

See the regex demo.

It will match a time string when not inside a loner colon-separated digit string (like 11:22:22:33).

Explanation:

  • (?<!\d:) - a negative lookbehind that matches a location that is not immediately preceded with a digit and :
  • (?<!\d) - a negative lookbehind that matches a location that is not immediately preceded with a digit (a separate lookbehind is necessary because Python re lookbehind only accepts a fixed-width pattern)
  • [0-5]?\d - an optional digit from 0 to 5 and then any 1 digit
  • : - a colon
  • [0-5]\d - a digit from 0 to 5 and then any 1 digit
  • (?!:?\d) - a negative lookahead that matches a location that is not immediately followed with an optional : and a digit.

Python online demo:

import re
description = "Tracks\n======\n0:00 Tonebox - Frozen Code\n6:00 SHIKIMO & DOOMROAR - Getaway\n9:30 d.notive - Streets of Passion\n14:55 Perturbator - Neo Tokyo"
results = re.findall(r'(?<!\d:)(?<!\d)[0-5]?\d:[0-5]\d(?!:?\d)', description)
print(results) 
# => ['0:00', '6:00', '9:30', '14:55']
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • I attempted your suggestion. When I print results, I get [ ]. The array's empty for some reason. Curiously, using the regex that is meant to capture timecodes past one hour, it's now printing all times '00:53:44', '00:56:11', '00:58:45', '01:01:40', '01:04:42'] Upvoted because of that knowledgable breakdown, thank you! – Lukabratzee Oct 15 '19 at 17:20
  • @Lukabratzee Please check [this Python demo](https://ideone.com/XDRlzC), it outputs `['0:00', '6:00', '9:30', '14:55']` when the input is `"Tracks\n======\n0:00 Tonebox - Frozen Code\n6:00 SHIKIMO & DOOMROAR - Getaway\n9:30 d.notive - Streets of Passion\n14:55 Perturbator - Neo Tokyo"`. – Wiktor Stribiżew Oct 15 '19 at 19:20
  • 1
    that's perfect. It works as expected and is more comprehensive than my original attempt. Thank you for all your help. I've upvoted but I needed more reputation for it to actually reflect :p – Lukabratzee Oct 16 '19 at 11:02
0

I think this is what you are looking for:

(^|[^\d:])([0-5]?[0-9]:[0-5][0-9])([^\d:]|$)

https://regex101.com/r/yERoPi/1

MonkeyZeus
  • 20,375
  • 4
  • 36
  • 77
  • `(^|[^\d:])(\d{1,2}:\d{2})([^\d:]|$)` [won't match](https://regex101.com/r/PmhDZ5/1) `23:45` in `23:45:Description` and will also match `88:99`. – Wiktor Stribiżew Oct 10 '19 at 19:59
  • @WiktorStribiżew I took care of `88:99` in my updated regex but what makes `23:45:description` a time code versus `Football team XYZ has a 3:45:loss streak` – MonkeyZeus Oct 10 '19 at 20:02
  • @MonkeyZeus I tried adding that in as ```description = re.findall(r'^|[^\d:])([0-5]?[0-9]:[0-5][0-9])([^\d:]|$', description)``` but I've not entered it correctly. Believe me when I say I have no clue what each position in the regex means. A breakdown of it would go a long way! – Lukabratzee Oct 10 '19 at 22:11