-1

I'd like to extract mp3 urls from a page source that does not have a specific word in them.

Here is the regular expression that I am using to search for mp3 urls:

https?:\/\/.+\.mp3

It works okay. Now I want to exclude those urls that have a specific word in them. So, I need urls that do not have a specific word in them.

How can I exclude a word between http and .mp3?

I will use it in Qt with C++, but as long as it works with https://regex101.com/ it is fine.

NESHOM
  • 899
  • 16
  • 46
  • Possible duplicate of [Regular expressions: Ensuring b doesn't come between a and c](https://stackoverflow.com/questions/37240408/regular-expressions-ensuring-b-doesnt-come-between-a-and-c) – CertainPerformance Jan 26 '19 at 03:37
  • @CertainPerformance - No, that is different. If you read the description, it says `contains 123 somewhere in the middle`. However, I want the expression NOT to contain a word. – NESHOM Jan 27 '19 at 02:56
  • 1
    It's exactly the same - see the last part of the question, `and there are no other instances of abc or xyz in the substring besides the start and the end.` - just like the top answer prevents `abc` from occurring in the middle of the match, you just need to apply the same logic to your pattern. – CertainPerformance Jan 27 '19 at 03:48

3 Answers3

3

If you want to "exclude those urls that do not have a specific word in them", you can use a positive lookahead for the word (with some number of characters before it) e.g.

(?=.*Sing)

In Javascript:

const word = 'Sing';
const urls = ['http://I_like_to_sing.mp3', 'http://Another_song.mp3'];
let regex = new RegExp('https?:\/\/(?=.*' + word + ').+\.mp3', 'i');
console.log(urls.filter(v => v.match(regex)));

In PHP

$word = 'Sing';
$urls = ['http://I_like_to_sing.mp3', 'http://Another_song.mp3'];
$regex = "/https?:\/\/(?=.*$word).+\.mp3/i";
print_r(array_filter($urls, function ($v) use ($regex) { return preg_match($regex, $v); }));

Output:

Array ( 
    [0] => http://I_like_to_sing.mp3 
)

Demo on 3v4l.org

Update

To exclude those URLs that do have a specific word in them, you can use a negative lookahead instead e.g.

(?![^.]*Sing)

We use [^.] to ensure the word occurs before the .mp3 part. Here's a PHP demo:

$word = 'Song';
$string = "some words http://I_like_to_sing.mp3 and then some other words http://Another_song.mp3 and some words at the end...";
$regex = "/(https?:\/\/(?![^.]*$word).+?\.mp3)/i";
preg_match_all($regex, $string, $matches);
print_r($matches[1]);

Output:

Array ( 
    [0] => http://I_like_to_sing.mp3
)

Demo on 3v4l.org

Nick
  • 138,499
  • 22
  • 57
  • 95
  • Sorry, there was a mistake in my question, I fixed it. – NESHOM Jan 27 '19 at 02:56
  • @NESHOM you shouldn't mark this accepted, it doesn't answer your actual question. I had been meaning to revisit the question though and I've made an edit which I think will solve your problem. – Nick Feb 01 '19 at 02:44
  • You ate right. It did help a bit, but not answered directly. So, please post your updated answer. Thanks. – NESHOM Feb 01 '19 at 03:21
0

I hope this can be a useful answer.

This a regular expression with use case on python3. So if you want to exclude a "word" between http & .mp3 you can do this.

import re

ref = "http://www.some_undesired_text_018/m102/1-225x338.mp3"

_del = re.findall(r'https?(.+)\.mp3', ref)[0]

out = ref.replace(_del, "")

#_del will contain the undesired word 
Franco Gil
  • 323
  • 3
  • 11
0

A minor edit to Nick's answer. You can exclude the word by negating the value returned from the match in the filter function like so:

urls.filter(v => !v.match(regex));

This works and is much easier than the other one solution further down, which gives an unexpected result.

const word = 'Sing';
const urls = ['http://I_like_to_sing.mp3', 'http://Another_song.mp3'];
let regex = new RegExp('https?:\/\/(?=.*' + word + ').+\.mp3', 'i');
console.log(urls.filter(v => !v.match(regex)));
SCouto
  • 7,808
  • 5
  • 32
  • 49
ecalogero
  • 31
  • 3