You said only the whole match is used, and the regex contains no backreferences. Therefore we can replace all capturing groups ((
)
) in the regex by non-capturing groups ((?:
)
). A few of the groups are redundant, and http|https
can be simplified to https?
. Together this gives us
(?:https?|ftp)://[\w_-]+(?:\.[\w_-]+)+(?:[\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?
_
is not allowed in hostnames:
(?:https?|ftp)://[\w-]+(?:\.[\w-]+)+(?:[\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?
Technically -
cannot appear at the beginning or end of a hostname, but we'll ignore that. Your regex doesn't allow non-default ports or IPv6 hosts either, but we'll ignore that, too.
The stuff matched by the last part of your regex (which is presumably meant to match path, query string, and anchor all together) can overlap with the hostname (both \w
and -
are in both character classes). We can fix this by requiring a separator of either /
or ?
after the hostname:
(?:https?|ftp)://[\w-]+(?:\.[\w-]+)+(?:[/?][\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?
Now we can start looking at your additional requirement: The URL should contain /video/hd/
. Presumably this string should appear somewhere in the path. We can encode this as follows:
(?:https?|ftp)://[\w-]+(?:\.[\w-]+)+/(?:[\w.,@^=%&:/~+-]*/)?video/hd/(?:[\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?
Instead of matching an optional separator of /
or ?
, we now always require a /
after the hostname. This /
must be followed by either video/hd/
directly or 0 or more path characters and another /
, which is then followed by video/hd/
. (The set of path characters does not include ?
(which would start the query string) or #
(which would start the anchor).)
As before, after /video/hd/
there can be a final part of more path components, a query string, and an anchor (all optional).