Capture string between \n [string] \n

Question

I'm trying to parse YouTube description's of songs to compile into a .csv

Currently I can isolate timecodes, though making an attempt on isolating the song and artist is proving trickier.

First, I catch the whitesapce

# catches whitespace
pattern = re.compile(r'\s+')

Second, the timecodes (to make the string simpler to deal with)

# catches timecodes
pattern1 = re.compile(r'[\d\.-]+:[\d.-]+:[\d\.-]+')

then I sub and remove.

I then try to capture all strings between \n, as this is how the tracklist is formatted

songBeforeDash = re.search(r'^([\\n][a-zA-Z0-9]*-[a-zA-Z0-9]*[\\n]*)+$', description)

The format follows \n[string]-[string]\n

Using this excellent visualiser , I've been able to tweak it so it catches the first result, however any subsequent results don't match. Is this a case of stopping at the first result and not catching the others?

Here's a sample of what I'm trying to catch

\nmiddleschoolxAso-Cypress\nShopan-Woodnot\nchromonicci-Memories.\nYasper-MoveTogether\nFenickxDelayde-Longwayhome\nauv-Rockaway5pm\nsadtoi-Aires\nGMillsxKyleMcEvoy-Haze\nRuckP-CoffeeBreak\n

My apologies! Work got in the way and I forgot to check back. The most appropriate answer was accepted. Thank you all for helping, I've made progress in my program from being able to parse what I needed :) — Lukabratzee, Oct 18 '19 at 09:24

score 3 · Accepted Answer · answered Oct 16 '19 at 11:26

3

You can do that with split()

t = '\nmiddleschoolxAso-Cypress\nShopan-Woodnot\nchromonicci-Memories.\nYasper-MoveTogether\nFenickxDelayde-Longwayhome\nauv-Rockaway5pm\nsadtoi-Aires\nGMillsxKyleMcEvoy-Haze\nRuckP-CoffeeBreak\n'

liste = t.split('\n')
liste = liste[1:-1:]
print(liste)

answered Oct 16 '19 at 11:26

Alexall

423
2
12

1

`liste = t.strip().split('\n')` – ekhumoro Oct 16 '19 at 12:02
yeah it's better – Alexall Oct 16 '19 at 12:10
@Alexall I tried this suggestion I'm given a blank array. [ ] – Lukabratzee Oct 16 '19 at 12:32
Your variable 'description' is a string ? – Alexall Oct 16 '19 at 12:35
@Alexall it is string, yeah. In fact when I try and print now, I get a blank line. Going through the debugger, the values beforehand are printing as normal. When it gets to the suggested code, there's no data displayed. – Lukabratzee Oct 16 '19 at 13:25
I tried again by cutting a section of string out like in your code, and it works flawlessly! My variable's string data then is the issue https://pastebin.com/aFUKFBLg – Lukabratzee Oct 16 '19 at 13:30

anerisgreat · Answer 2 · 2019-10-16T11:31:59.177

2

re.search only returns the first match in the string. What you want is to use re.findall which returns all matches.

EDIT - Because your matches would overlap, I would suggest editing the regex to capture until the next newline. Right now they cannot overlap. Consider changing the regex to this:

r'^([\\n][a-zA-Z0-9]*-[a-zA-Z0-9]*)+$'

If what you want is for them to overlap (meaning you want to capture the newlines too), I suggest looking here to see how to capture overlapping regex patterns.

Also, as suggested by @ChatterOne, using the str.split(seperator) method would work well here, assuming no other type of information is present.

descriptor.split('\n')

edited Oct 16 '19 at 11:31

answered Oct 16 '19 at 11:25

anerisgreat

342
1
7

In this case it will not work anyway, because of the "greedy" behaviour of the regexp. And even if using a "lazy" behaviour (`.*?`) it will still not work because it will not use the ending `\n` as the starting point for a new match, so it will match one, skip one, match one, and so on. The best approach here is to use `split` as suggested in the other answer – ChatterOne Oct 16 '19 at 11:29
Agree, fixed comment, added credit where credit is due. – anerisgreat Oct 16 '19 at 11:35
`re.findall(r'(?<=\n)[^\n]+', description)`. – ekhumoro Oct 16 '19 at 12:07

Capture string between \n [string] \n

2 Answers2