1

I am making a small little project that should extract song and artist names from youtube videos. Currently I have the video description that has the following structure

Text text text

Tracklist:
[00:00] Sobs - Girl
[02:25] Mopac - Cross-Eyed Dreaming
[05:54] L I P S - In Summer
[09:18] Small Wood House - T.V

Text text text

I want to be able to extract the Artist name and Song name from this string. I am trying to use Regex to do this and the regex I have right now matches the timecode and any text before a newline.

'((.*([0-9]?[0-9]:)?[0-5][0-9]:[0-5][0-9]).*\n)+'

Now I need to find a way to match any text before the timecodes but not include them in the final string and also do this for the timecode. I tried to use capturing groups but it was unsuccessful.

The result I want should look like this

Sobs - Girl
Mopac - Cross-Eyed Dreaming
L I P S - In Summer
Small Wood House - T.V

3 Answers3

4

Try regex, and capture the required group after encountering the time tag inside square braces.:

text = '''Tracklist:
[00:00] Sobs - Girl
[02:25] Mopac - Cross-Eyed Dreaming
[05:54] L I P S - In Summer
[09:18] Small Wood House - T.V'''

import re
re.findall('\[\d{2}:\d{2}\]\s*(.*)', text)

OUTPUT:

['Sobs - Girl',
 'Mopac - Cross-Eyed Dreaming',
 'L I P S - In Summer',
 'Small Wood House - T.V']
ThePyGuy
  • 17,779
  • 5
  • 18
  • 45
  • It's interesting, how `re.findall()` will register only the artist name and song name? – ARK1375 Jun 18 '21 at 16:24
  • I think it's due to the use of paranthesis. Apprently regex only captures stuff within `( )` – Cristian Trusin Jun 18 '21 at 16:27
  • Do you think you can split down the Regex pattern and explain what does every element in it mean? I have a hard time understanding regex. – Red Jun 18 '21 at 16:38
  • Always refer to the [documentation](https://docs.python.org/3/library/re.html) and understand the meta characters, there is an [excellent thread for regex on stackoverflow](https://stackoverflow.com/questions/4736/learning-regular-expressions) as well. Talking about above regex, `\d` will match any digit from 0-9, `{2}` will allow the length of two only, `\s*` will match any number of white space characters, `.*` matches any characters, and anything inside `(` `)` will be treated as a group and will be captured. – ThePyGuy Jun 18 '21 at 16:43
  • Take a look here: https://regex101.com/r/6b44UY/1 Why this regex captures only the artist and song name is because `(.*)` is inside paranthesis, making it a capture group. – Cristian Trusin Jun 18 '21 at 16:44
1

You can match the time part with the part from your tried pattern, and capture what is after it in a capture group.

\[[0-5][0-9]:[0-5][0-9]]\s*(.+?\s+-\s+.+)

The pattern matches:

  • \[[0-5][0-9]:[0-5][0-9]] Match 2 times from 00 - 59 between square brackets
  • \s* Match optional whitespace chars
  • (.+?\s+-\s+.+) Capture group 1, match the rest of the line, and make sure it contains -

Regex demo

Example code

import re
 
pattern = r"\[[0-5][0-9]:[0-5][0-9]]\s*(.+?\s+-\s+.+)"
 
s = ("Text text text\n\n"
    "Tracklist:\n"
    "[00:00] Sobs - Girl\n"
    "[02:25] Mopac - Cross-Eyed Dreaming\n"
    "[05:54] L I P S - In Summer\n"
    "[09:18] Small Wood House - T.V\n\n"
    "Text text text")

Output

[
'Sobs - Girl',
'Mopac - Cross-Eyed Dreaming',
'L I P S - In Summer',
'Small Wood House - T.V'
]

Or you can capture what is before and after - in a capture group, to get separate matches for the name and the song.

\[[0-5][0-9]:[0-5][0-9]]\s*(.+?)\s+-\s+(.+)

Regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
0

Here the regex I would use:

((\[\d+:\d+\])\s+(.*)\s+-\s+(.*))

Here is the example https://regex101.com/r/6b44UY/1

This capture everything and makes three different group - One for the time - One for the Artist - one for the Title

If you want to register what is before the time "[00:00]" you can just do:

((.*)(\[\d+:\d+\])\s+(.*)\s+-\s+(.*))

In python this would do:

import re

result = re.findall(r'((\[\d+:\d+\])\s+(.*)\s+-\s+(.*))', my_text)
CyDevos
  • 395
  • 4
  • 11
  • Thank you so much. This works like a charm and the regex test website will come in handy for future work. Cheers! – Cristian Trusin Jun 18 '21 at 16:16
  • There is just one small problem some timecodes are in the format H:MM:SS `[59:56] Tops - Petals [1:02:51] Wild Nothing - Live in Dreams [1:06:17] Women - Eyesore` Those do not get captured. I do not really get your regex yet, could give me a couple of hints how to capture those as well? – Cristian Trusin Jun 18 '21 at 16:24
  • I did it `r'((\[(?:\d+:)?\d+:\d+\])\s+(.*)\s+-\s+(.*))'` Thank you so much for your help!! – Cristian Trusin Jun 18 '21 at 16:29