Hossein, there are several points and questions in your question.
A. How to include or exclude some specific patterns in regex?
There are many techniques. For simple patterns, you specify what you want, or you specify what you don't want, either with negative character classes or negative lookaround. For more intricate patterns, a great place to start is Match (or replace) a pattern except in situations s1, s2, s3 etc
B. How can a specific word be included or excluded?
In general, to make sure a specific word belongs or doesn't belong to a string, if you don't know its placement, you do a lookahead (or negative lookahead) at the beginning of the string:
^(?=.*?MyWord) # makes sure the word is there
or
^(?!.*?MyWord) # makes sure the word is not there
C. What is clear now, is that (http) is not treated like a word, it is just a class set of characters, so any word that has only one of those letters gets a match
That is not correct. (http)
will only match http
. It will not match ptth
, for instance. Perhaps you are thinking of [http]
, which would be a character class allowing characters h, t and p to be matched once (and inefficient since [pth]
would do)
D. How to Match the Parts of a URL
There are many solutions to this, but for today I'd suggest not reinventing the wheel. May I suggest the regex in the RegexBuddy library for this purpose? It is
(?i)\b((?#protocol)https?|ftp)://((?#domain)[-A-Z0-9.]+)((?#file)/[-A-Z0-9+&@#/%=~_|!:,.;]*)?((?#parameters)\?[A-Z0-9+&@#/%=~_|!:,.;]*)?
Here follows a token-by-token explanation (I added the case-insensitive (?i)
modifier at the beginning.)
- Assert position at a word boundary (position preceded or followed—but not both—by a Unicode letter, digit, or underscore)
\b
- Match the regex below and capture its match into backreference number 1
((?#protocol)https?|ftp)
- Match this alternative (attempting the next alternative only if this one fails)
(?#protocol)https?
- Comment: protocol
(?#protocol)
- Match the character string “http” literally (case insensitive)
http
- Match the character “s” literally (case insensitive)
s?
- Between zero and one times, as many times as possible, giving back as needed (greedy)
?
- Or match this alternative (the entire group fails if this one fails to match)
ftp
- Match the character string “ftp” literally (case insensitive)
ftp
- Match the character string “://” literally
://
- Match the regex below and capture its match into backreference number 2
((?#domain)[-A-Z0-9.]+)
- Comment: domain
(?#domain)
- Match a single character present in the list below
[-A-Z0-9.]+
- Between one and unlimited times, as many times as possible, giving back as needed (greedy)
+
- The literal character “-”
-
- A character in the range between “A” and “Z” (case insensitive)
A-Z
- A character in the range between “0” and “9”
0-9
- The literal character “.”
.
- Match the regex below and capture its match into backreference number 3
((?#file)/[-A-Z0-9+&@#/%=~_|!:,.;]*)?
- Between zero and one times, as many times as possible, giving back as needed (greedy)
?
- Comment: file
(?#file)
- Match the character “/” literally
/
- Match a single character present in the list below
[-A-Z0-9+&@#/%=~_|!:,.;]*
- Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
*
- The literal character “-”
-
- A character in the range between “A” and “Z” (case insensitive)
A-Z
- A character in the range between “0” and “9”
0-9
- A single character from the list “+&@#/%=~_|!:,.;”
+&@#/%=~_|!:,.;
- Match the regex below and capture its match into backreference number 4
((?#parameters)\?[A-Z0-9+&@#/%=~_|!:,.;]*)?
- Between zero and one times, as many times as possible, giving back as needed (greedy)
?
- Comment: parameters
(?#parameters)
- Match the character “?” literally
\?
- Match a single character present in the list below
[A-Z0-9+&@#/%=~_|!:,.;]*
- Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
*
- A character in the range between “A” and “Z” (case insensitive)
A-Z
- A character in the range between “0” and “9”
0-9
- A single character from the list “+&@#/%=~_|!:,.;”
+&@#/%=~_|!:,.;