Why are you limiting the TLD to 4 characters? There are many valid TLDs that exceed beyond that such as .finance
, .movie
, .academy
, etc.
You can use my answer from a previous post and make some minor adjustments.
(?(DEFINE)
(?<scheme>[a-z][a-z0-9+.-]*)
(?<userpass>([^:@\/](:[^:@\/])?@))
(?<domain>[a-z0-9]+(-[a-z0-9]+)*(\.[a-z0-9]+(-[a-z0-9]+)*)+)
(?<ip>(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])))
(?<host>((?&domain)|(?&ip)))
(?<port>(:[\d]{1,5}))
(?<path>([^?;\#\s]*))
(?<query>(\?[^\#;\s]*))
(?<anchor>(\#\S*))
)
(?:^)?-\ +((?:(?&scheme):\/\/)?(?&userpass)?(?&host)(?&port)?\/?(?&path)?(?&query)?(?&anchor)?)(?:$|\s+)
You can see this regex in use here. This should catch all valid URLs (albeit the scheme is considered optional in your case, so I've made the scheme optional in the regex)