-2

I am using the following regex

(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?

and it's showing me a url but I want to show only URLS that contain

/video/hd/

The following correction of the Regex above did not deal correctly with slashes

((?:https\:\/\/)|(?:http\:\/\/)|(?:www\.))?([a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(?:\??)[a-zA-Z0-9\-\._\?\,\'\/\\\+&%\$#\=~]+)
Francesco B.
  • 2,729
  • 4
  • 25
  • 37
John
  • 307
  • 2
  • 5
  • 16
  • What language are you using? You didn't escape all the appearances of `/` – GalAbra Mar 11 '18 at 19:03
  • The regex is within a chrome plugin the documentation just lists Regex The regular expression attribute can be used to extract a substring of the text that the selector extracts. When a regular expression is used the whole match (group 0) will be returned as a result – John Mar 11 '18 at 19:07
  • Which chrome plugin? – melpomene Mar 11 '18 at 19:08
  • http://webscraper.io – John Mar 11 '18 at 19:39

2 Answers2

3

You said only the whole match is used, and the regex contains no backreferences. Therefore we can replace all capturing groups (( )) in the regex by non-capturing groups ((?: )). A few of the groups are redundant, and http|https can be simplified to https?. Together this gives us

(?:https?|ftp)://[\w_-]+(?:\.[\w_-]+)+(?:[\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?

_ is not allowed in hostnames:

(?:https?|ftp)://[\w-]+(?:\.[\w-]+)+(?:[\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?

Technically - cannot appear at the beginning or end of a hostname, but we'll ignore that. Your regex doesn't allow non-default ports or IPv6 hosts either, but we'll ignore that, too.

The stuff matched by the last part of your regex (which is presumably meant to match path, query string, and anchor all together) can overlap with the hostname (both \w and - are in both character classes). We can fix this by requiring a separator of either / or ? after the hostname:

(?:https?|ftp)://[\w-]+(?:\.[\w-]+)+(?:[/?][\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?

Now we can start looking at your additional requirement: The URL should contain /video/hd/. Presumably this string should appear somewhere in the path. We can encode this as follows:

(?:https?|ftp)://[\w-]+(?:\.[\w-]+)+/(?:[\w.,@^=%&:/~+-]*/)?video/hd/(?:[\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?

Instead of matching an optional separator of / or ?, we now always require a / after the hostname. This / must be followed by either video/hd/ directly or 0 or more path characters and another /, which is then followed by video/hd/. (The set of path characters does not include ? (which would start the query string) or # (which would start the anchor).)

As before, after /video/hd/ there can be a final part of more path components, a query string, and an anchor (all optional).

melpomene
  • 84,125
  • 8
  • 85
  • 148
1

First of all, you need a regex to match URLs (be they http, https...)

(([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+@)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+@)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%@.\w_]*)#?(?:[\w]*))

Once you got that, you need to select them but not "consume" them. You can do this with a lookahed, i.e. a regex that assert that what follows the current position is e.g. foo:

(?=foo)

Of course you will replace foo with the first regex I wrote.

At this point, you know you selected a URL; now you just constraint your search to URLs that contain /video/hd:

.*\/video\/hd\/.*

So the complete regex is

(?=(([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+@)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+@)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%@.\w_]*)#?(?:[\w]*))).*\/video\/hd\/.*

You can test it here with a live demo.

Francesco B.
  • 2,729
  • 4
  • 25
  • 37
  • 1
    Thank you for all the information and guide one problem I found that the regex was also matching on a semicolon as part of the url link(https://regex101.com/r/nmW218/1) – John Mar 12 '18 at 23:53
  • Yes they are and actually have a specific [meaning](https://stackoverflow.com/questions/1178024/can-a-url-contain-a-semi-colon) – Francesco B. Mar 13 '18 at 06:11