Matching only "featured artist" in a set of filenames -- current regex too greedy

Question

I'm writing a script in python to extract the name of the featured artist from an mp3 filename and set the appropriate id3v2 tag of the file. The filenames are in 3 different formats:

Artist - Track ft. FeatArtist.mp3
Artist ft. FeatArtist - Track.mp3
Artist - Track (ft. FeatArtist).mp3

This is the regex that I wrote:

r'ft\. (.+)[.-)]'

Then I can use re.findall to get the contents of the group. But this is what I get:

In [40]: r = r'ft\. (.+)[.\-)]'

In [47]: re.findall(r, 'Artist - Track ft. FeatArtist.mp3')
Out[47]: ['FeatArtist']

In [48]: re.findall(r, 'Artist ft. FeatArtist - Track.mp3')
Out[48]: ['FeatArtist - Track']

In [49]: re.findall(r, 'Artist - Track (ft. FeatArtist).mp3')
Out[49]: ['FeatArtist)']

My intended output is in all three cases precisely:

FeatArtist

The problem is that the regex is matching as much as it can - I want it to to stop as soon as it finds one of the characters in [.\-)]. How can I do this ?

Huh? A character class only matches one character. "greedy" and "non-greedy" doesn't really make sense in that context. — Charles Duffy, May 08 '17 at 16:46
I'm actually very sympathetic to the argument for closing the question (or answering it literally but unhelpfully) until it's updated to be entirely explicit. My answer addresses the *literal* problem the OP gave, eliminating greediness; they didn't provide sample output that says that they want `Feat.Artist` instead of `Feat`, and they *did* say that they wanted a non-greedy match, so `Feat` is exactly what substituting a non-greedy match for a greedy one provides. They want something other than that? **It's on them to ask a better question.** — Charles Duffy, May 08 '17 at 17:51
@CharlesDuffy In a court room you'd win that argument hands down. Two admins with well over 100k rep I wouldn't even get a space in the parking lot. But it's very clear (or at the very least, a reasonable best case based on the information) from the sample output and the clear specification of 3 types of input files, and the purpose it's being used for, that `Feat` wouldn't cut it. — hmedia1, May 08 '17 at 17:56
Sure. But StackOverflow's purpose from founding hasn't been to help the person asking a question one-on-one -- it is and has always been to build a long-tail knowledge base; helping people one-on-one is something we do *towards that larger goal*. This means that questions should be written to help not just the person who first asked them, but other folks who find them on Google, via search, etc.; if someone else ends up here via Google trying to find out how to make a search non-greedy, an answer that goes off on a tangent is unlikely to help. — Charles Duffy, May 08 '17 at 18:04
So, how do we square the conflict? We make the OP improve their question, to explicitly ask *exactly what they want to know* and to express that in the title, such that someone else who finds this in search results will know what they're getting from title and extracted summary. — Charles Duffy, May 08 '17 at 18:04
Agreed. In the mean time the question is now locked. I tried providing a detailed answer, and the post button was greyed by the time I'd finished. I tried suggesting an edit, but the edit button was greyed out. I agree the OP question could have been better, and could have communicated and clarified more throughout. However, it was the crystal clear 3 types of input format, and the purpose of it that really made it quite straightforward. Not a lot of options to help in any of those goals when there's a thread that now remains active but is less helpful than what it could of been. — hmedia1, May 08 '17 at 18:15
Hmm. This is only showing as closed, not deleted, so the question *should* still be amenable to edits. Maybe there's a rep level required for folks other than the OP to edit a closed question? Gotta admit that it's a bit easy to lose track of who has permissions to do what. — Charles Duffy, May 08 '17 at 21:08
@hmedia1, I just edited the question for intent and reopened it -- feel free to add an answer. — Charles Duffy, May 08 '17 at 21:13
@CharlesDuffy Appreciate it. Answer posted. I'm a bit late to the party now, but like you said, the goal here is to build a knowledge base, so it's good to see two helpful answers and no intrusive warnings or downvotes. PS: I liked how you pasted the python console for unit tests. Will start doing this kind of thing myself. — hmedia1, May 10 '17 at 02:09

hmedia1 · Accepted Answer · 2017-05-10T06:02:30.150

For python

For your specific requirement according to your filename formats:

re.findall(r'ft\.\s*(\w*)',filename)

Each of these filenames:

Artist - Track ft. FeatArtist.mp3
Artist ft. FeatArtist - Track.mp3
Artist - Track (ft. FeatArtist).mp3

Will return:

```
['FeatArtist']
```

If you want to account for a number of other possible scenarios:

In your provided examples, each FeatArtist terminates with one of the following: A space followed by a -, a round close bracket, and the file extension .mp3

If we had any of the following:

Feat.Artist
Feat Artist
Feat Middlename Artist
Feat Artist One & Artist Two

Things might fall apart. One way to tackle the above variants might be:

First get rid of the file extension without using string matching at all. Doing this with filenames gives you a cleaner starting point:

Using os.path.splitext('Artist - Track ft. FeatArtist.mp3')[0]) you can get your files in this format: Artist - Track ft. FeatArtist

We can accomodate the new filenames with this regex:

re.findall(r'ft\.\s*(\w*.*?)(?= -|\)|$)', filename)

Unit Tests: (Listed respectively for easier reading):

>>> re.findall(r'ft\.\s*(\w*.*?)(?= -|\)|$)','Artist - Track ft. FeatArtist')
>>> re.findall(r'ft\.\s*(\w*.*?)(?= -|\)|$)','Artist ft. FeatArtist - Track')
>>> re.findall(r'ft\.\s*(\w*.*?)(?= -|\)|$)','Artist - Track (ft. FeatArtist)')
>>> re.findall(r'ft\.\s*(\w*.*?)(?= -|\)|$)','Artist - Track (ft. Feat Artist)')
>>> re.findall(r'ft\.\s*(\w*.*?)(?= -|\)|$)','Artist - Track (ft. Feat Artist & Other Artist)')
>>> re.findall(r'ft\.\s*(\w*.*?)(?= -|\)|$)','Artist ft. Feat Artist & Other Artist - Track')
>>> re.findall(r'ft\.\s*(\w*.*?)(?= -|\)|$)','Artist ft. Feat.Artist & Crew - Track')

Results:

['FeatArtist']
['FeatArtist']
['FeatArtist']
['Feat Artist']
['Feat Artist & Other Artist']
['Feat Artist & Other Artist']
['Feat.Artist & Crew']

Why no lookbehind ?

From the python man (formatting added):

re.findall(pattern, string, flags=0) Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

Therefore you can still use repition operators to establish the match, and use groups to control the portion of the match returned.

Other ways to do something similar:

If using a regex engine that supports \K back reference, then the match would be everything after the \K:

Examples using grep with -P (Perl Regex) and -o (Only return match):

echo "Artist - Track ft. FeatArtist" | grep -oP "ft\.\s*\K(\w*.*?)(?= -|\)|$)"
FeatArtist

echo "Artist ft. FeatArtist - Track" | grep -oP "ft\.\s*\K(\w*.*?)(?= -|\)|$)"
FeatArtist

echo "Artist - Track (ft. FeatArtist)" | grep -oP "ft\.\s*\K(\w*.*?)(?= -|\)|$)"
FeatArtist

echo "Artist ft. Feat Artist & Other Artist - Track" | grep -oP "ft\.\s*\K(\w*.*?)(?= -|\)|$)"
Feat Artist & Other Artist

score 0 · Answer 2 · answered May 09 '17 at 15:04

0

This should work:

(?<=ft\. )[^\-)\. ]+

(?<=ft. ) look for a string that has ft. before

)[^-). ]+ the string has to be a word, without spaces/dashes/brackets/dots.

answered May 09 '17 at 15:04

Xyzk

1,332
2
21
36

It should be ` [^-).]+`, instead of ` [^-). ]+` I think, as there might be cases where the ft. artist's name will have spaces. Thanks a lot for the answer :) – sudormrfbin May 09 '17 at 15:17
@Gokul yeah. That being said, if I was you I would check the string programatically and remove any whitespaces at the end, as the regex without the space can produce a string like `artist `, which when you compare to `artist` (no space at the end) will return false. – Xyzk May 09 '17 at 16:40
I accepted hmedia1 answer, because it was a bit more extensive and might be more useful to someone else looking for a solution. But your answer works just fine; thanks anyways – sudormrfbin May 10 '17 at 15:36