Regex only for specific domain name in URL

Question

As much as I've tried I can't seem to find the correct regex to locate what I'm after here.

I only want to select the first instance of the url that matches the domain www.myweb.com from the following...

Some text https://www.myweb.com/page/cat/323123442321-rghe432 and then another https://www.adifferentsite.com/fsdhjss/erwr

I need to completely ignore the second url www.adifferentsite.com and only work with the first one that matches www.myweb.com, ignoring any other possible instances of www.myweb.com

Once the first matching domain is discovered I need to store the rest of the url that comes after it...

page/cat/323123442321-rghe432

...into a new variable $newvar, so...

$newvar = 'page/cat/323123442321-rghe432';

I'm trying :

return preg_replace_callback( '/http://www.myweb.com/\/[0-9a-zA-Z]+/', array( __CLASS__, 'my_callback' ), $newvar );

I've read tons of documents on how to detect url's but can't find anything about detecting a specific url.

I really can't grasp how to formulate regex so this formula is incorrect. Any help would be greatly appreciated.

EDIT Edited the question to be a bit more specific and hopefully a bit easier to resolve.

If you need to match, why replace? And when you created a regex, did you pay attention at the regex delimiters? I guess you get an Uknown delimiters error. — Wiktor Stribiżew, Dec 02 '15 at 12:20
I'm replacing because I'm going to be formulating the link into an oEmbed link where the url format is different. so https://www.myweb.com/page/cat/323123442321-rghe432 will become https://embed.myweb.com/page/cat/323123442321-rghe432 when rendered after this filter — Grant, Dec 02 '15 at 12:22
Regex wise, I just don't understand how it works. I've read and read and read and just can't seem to grasp the correct way of doing it. — Grant, Dec 02 '15 at 12:25
Ok, I think you can just use [`'~\bhttps?://www\.myweb\.com/(\S+)~'`](https://regex101.com/r/tZ7qV9/1) regex and push the `$m[1]` into the array for "the rest of URL". — Wiktor Stribiżew, Dec 02 '15 at 12:25
When you say push the $m[1] into the array for the rest of the url, could you explain that a bit more please? I understand what you are saying but I don't know how I would do that? — Grant, Dec 02 '15 at 12:27
Your example shows you want to capture the `-` but you didn't include that in your character class. Do `[0-9a-zA-Z\-]+` And what @stribizhev is referring to is the `/` you use to wrap your regex. You need to use a different character since you're telling the regex to look for that in your regex (`http://`) you can change the wrapping `/` to `~` or `@`. [The php docs talk about it](http://php.net/manual/en/regexp.reference.delimiters.php) — , Dec 02 '15 at 12:36
Thanks @stribizhev that should be enough to get me going in the right direction — Grant, Dec 02 '15 at 12:46
@stribizhev I'm trying to alter your code so it doesn't store the results as an array. I only need to return the first matching instance so I don't need to use an array here. I'm not managing to make it work without using an array though. Is using an array the only way to do this? — Grant, Dec 02 '15 at 13:12
@Grant: I posted an improved demo in my answer with some explanations. — Wiktor Stribiżew, Dec 02 '15 at 13:13

Wiktor Stribiżew · Accepted Answer · 2015-12-02T13:21:06.510

2

You can use a preg_replace_callback and pass an array into the anonymous function (or just your custom callback function) to fill it with all the necessary URL parts.

Here is a demo:

$rests = array();
$re = '~\b(https?://)www\.myweb\.com/(\S+)~'; 
$str = "Some text https://www.myweb.com/page/cat/323123442321-rghe432 and then another https://www.adifferentsite.com/fsdhjss/erwr"; 
echo $result = preg_replace_callback($re, function ($m) use (&$rests) {
    array_push($rests, $m[2]);
    return $m[1] . "embed.myweb.com/" . $m[2];
}, $str) . PHP_EOL;
print_r($rests);

Results:

Some text https://embed.myweb.com/page/cat/323123442321-rghe432 and then another https://www.adifferentsite.com/fsdhjss/erwr
Array
(
    [0] => page/cat/323123442321-rghe432
)

A couple of words:

'~\b(https?://)www\.myweb\.com/(\S+)~' has ~ as a regex delimiter, so you do not have to escape /
It is declared with a single-quoted literal, so you do not have to use double-escaping for \\S
It matches and captures into capturing groups 2 substrings: \b(https?://) (that matches a whole word http or https followed by ://) and (\S+) (that matches 1 or more non-whitespace characters). These capturing groups are marked with (...) in the pattern and can be accessed via $matches[n] where n is the id of the capturing group.

UPDATE

If you only need to replace the first occurrence of the URL, pass the limit argument to the preg_replace_callback:

$rest = "";
$re = '~\b(https?://)www\.myweb\.com/(\S+\b)~'; 
$str = "Some text https://www.myweb.com/page/cat/323123442321-rghe432, another http://www.myweb.com/page/cat/323123442321-rghe432 and then another https://www.adifferentsite.com/fsdhjss/erwr"; 
echo $result = preg_replace_callback($re, function ($m) use (&$rest) {
    $rest = $m[2];
    return $m[1] . "embed.myweb.com/" . $m[2];
}, $str, 1) . PHP_EOL;
//-LIMIT ^ - HERE -
echo $rest;

See another IDEONE demo

edited Dec 02 '15 at 13:21

answered Dec 02 '15 at 13:08

Wiktor Stribiżew

607,720
39
448
563

This works perfectly, but I don't need an array in this instance. I only need to match and collect the url for the first instance of the url. Is this routine possible without using an array? – Grant Dec 02 '15 at 13:13
Just note that `use (&$rests)` requires the use of `&$` since we need to write to the array inside the callback function. – Wiktor Stribiżew Dec 02 '15 at 13:14
1

Do you mean to say that you have `Some text https://www.myweb.com/page1 and https://www.myweb.com/page2` and you only want to get the first one replaced only? Use `1` as the last argument to `preg_replace_callback`. – Wiktor Stribiżew Dec 02 '15 at 13:17
1

Thank you @stribizhev that is perfect! – Grant Dec 02 '15 at 13:54
One last question if you don't mind. What if I wanted to only collect PART of the of the url. So from https://www.myweb.com/page/cat/323123442321-rghe432 - I only wanted to store 'page/cat' (without trailing slash after cat) to the $rest var. Just trying to understand how to use regex a bit better. – Grant Dec 02 '15 at 14:14
1

Use negated character class `[^/]*` to match 0 or more characters other than `/` to stay inside the URL parts. See [demo](https://regex101.com/r/hU5yS9/1). – Wiktor Stribiżew Dec 02 '15 at 14:16
Oh man I wish I knew more about regex.. I just can't get it working.. Demo - https://ideone.com/sPRcOW – Grant Dec 02 '15 at 14:43
Why using `preg_replace`? You say you need to get the value. Use `preg_match`: `$re = '~https?://www\.myweb\.com/([^/]*/[^/]*)~'; $str = "Some text https://www.myweb.com/page/cat/323123442321-rghe432, another http://www.myweb.com/page/cat/323123442321-rghe432 and then another https://www.adifferentsite.com/fsdhjss/erwr"; preg_match($re, $str, $m); echo $m[1];` ([demo](http://ideone.com/kWXSKl)) – Wiktor Stribiżew Dec 02 '15 at 14:49
But I need to remove the original text url completely and replace it with a new url that I will generate using the matched elements – Grant Dec 02 '15 at 15:00
1

Ok, use [this one](https://ideone.com/PyQK0q). `$re = '~(https?://)www\.myweb\.com/(([^/]*/[^/]*)\S+\b)~';` and then `$result = preg_replace_callback($re, function ($m) use (&$rest) { $rest = $m[3]; return $m[1] . "embed.myweb.com/" . $m[2]; }, $str, 1)` – Wiktor Stribiżew Dec 02 '15 at 15:06
That is exactly what I was after. Thank you so much, I've learned a lot more about regex in this thread than I've been able to do reading anything else. – Grant Dec 02 '15 at 19:49

Regex only for specific domain name in URL

1 Answers1