1

I'm trying to parse YouTube description's of songs to compile into a .csv

Currently I can isolate timecodes, though making an attempt on isolating the song and artist is proving trickier.

First, I catch the whitesapce

# catches whitespace
pattern = re.compile(r'\s+')

Second, the timecodes (to make the string simpler to deal with)

# catches timecodes
pattern1 = re.compile(r'[\d\.-]+:[\d.-]+:[\d\.-]+')

then I sub and remove.

I then try to capture all strings between \n, as this is how the tracklist is formatted

songBeforeDash = re.search(r'^([\\n][a-zA-Z0-9]*-[a-zA-Z0-9]*[\\n]*)+$', description)

The format follows \n[string]-[string]\n

Using this excellent visualiser , I've been able to tweak it so it catches the first result, however any subsequent results don't match. Is this a case of stopping at the first result and not catching the others?

Here's a sample of what I'm trying to catch

\nmiddleschoolxAso-Cypress\nShopan-Woodnot\nchromonicci-Memories.\nYasper-MoveTogether\nFenickxDelayde-Longwayhome\nauv-Rockaway5pm\nsadtoi-Aires\nGMillsxKyleMcEvoy-Haze\nRuckP-CoffeeBreak\n
Lukabratzee
  • 137
  • 1
  • 10
  • some of the example string does not begin and end with \n – LinPy Oct 16 '19 at 11:26
  • My apologies! Work got in the way and I forgot to check back. The most appropriate answer was accepted. Thank you all for helping, I've made progress in my program from being able to parse what I needed :) – Lukabratzee Oct 18 '19 at 09:24

2 Answers2

3

You can do that with split()

t = '\nmiddleschoolxAso-Cypress\nShopan-Woodnot\nchromonicci-Memories.\nYasper-MoveTogether\nFenickxDelayde-Longwayhome\nauv-Rockaway5pm\nsadtoi-Aires\nGMillsxKyleMcEvoy-Haze\nRuckP-CoffeeBreak\n'

liste = t.split('\n')
liste = liste[1:-1:]
print(liste)
Alexall
  • 423
  • 2
  • 12
2

re.search only returns the first match in the string. What you want is to use re.findall which returns all matches.


EDIT - Because your matches would overlap, I would suggest editing the regex to capture until the next newline. Right now they cannot overlap. Consider changing the regex to this:

r'^([\\n][a-zA-Z0-9]*-[a-zA-Z0-9]*)+$'

If what you want is for them to overlap (meaning you want to capture the newlines too), I suggest looking here to see how to capture overlapping regex patterns.

Also, as suggested by @ChatterOne, using the str.split(seperator) method would work well here, assuming no other type of information is present.

descriptor.split('\n')
anerisgreat
  • 342
  • 1
  • 7
  • In this case it will not work anyway, because of the "greedy" behaviour of the regexp. And even if using a "lazy" behaviour (`.*?`) it will still not work because it will not use the ending `\n` as the starting point for a new match, so it will match one, skip one, match one, and so on. The best approach here is to use `split` as suggested in the other answer – ChatterOne Oct 16 '19 at 11:29
  • Agree, fixed comment, added credit where credit is due. – anerisgreat Oct 16 '19 at 11:35
  • `re.findall(r'(?<=\n)[^\n]+', description)`. – ekhumoro Oct 16 '19 at 12:07