0

I want a regular expression to grab urls that does not contain specific word in their domain name but no matter if there is that word in the query string or other subdirectories of the domain.Also it doesn't matter how the hrl starts for exmaple by http/fttp/https/without any of them. I found this expression ^((?!foo).)*$") I don't know how should I change it to fit into these conditions. These are the accepted url for the word "foo":

whatever.whatever.whatever/foo/pic
whatever.whatever.whatever?sdfd="foo"

and these are not accepted:

whatever.whateverfoo.whatever
whatever.foowhatever.whatever
whatever.foo.whatever.whatever
whatever.whatever.foo.whatever
barfuin
  • 16,865
  • 10
  • 85
  • 132
Marjan
  • 261
  • 4
  • 13

3 Answers3

1

Try this (explanation):

^(?:(?!foo).)*?[\/\?]

What this means is basically:

  1. match anthing not containing foo
  2. until a slash or question mark is encountered

The precise syntax may vary depending on your programming language/editor. The explanation link shows the PHP example. The regex elements I've used are pretty common, so it should work for you. If not, let me know.

This regex can only be matched against a single URL at a time. So if you are trying this in regex101, don't enter all URLs at once.


Update: Example in Java (now using turner instead of foo):

Pattern p = Pattern.compile("^(?:(?!turner).)*?[\\/\\?].*");
System.out.println(p.matcher(
    "i.cdn.turner.com/cnn/.e/img/3.0/1px.gif").matches());
System.out.println(p.matcher(
    "www.facebook.com/plugins/like.php?href=http%3A%2F%2F"
    + "www.facebook.com%2Fturnerkjl‌​jl").matches());

Output:

false
true
barfuin
  • 16,865
  • 10
  • 85
  • 132
  • thanks, but I'm not looking for string that has query string of foo. I'm looking for urls that do not contain foo in their main domain. – Marjan Sep 26 '13 at 18:15
  • I tested this ^((?!turner).)*?[\/\?] but it matches with http://i.cdn.turner.com/cnn/.e/img/3.0/1px.gif which I don't want it. – Marjan Sep 26 '13 at 18:58
  • @Marjan In my regex101 link, your test input works! What environment are you using? Java? Unix shell script ...? – barfuin Sep 26 '13 at 19:15
  • Tried it in [pythonregex.com](http://www.pythonregex.com/) and it worked, too. Maybe you can post what your input into pythonregex.com is? – barfuin Sep 26 '13 at 20:33
0

Here's a regex that will match the cases that you want to reject

(?:.+://){0,1}(?<subdomain>[^.]+\.){0,1}(?<domain>[^.]*whatever[^.]*\.)(?<top>[^.]+).*

(?: ) is a non-capturing group

(?<groupName> ) is a named group (useful for testing, in regexhero you can see what is being captured by the group)

{0,1} means 0 or 1

. means any character except new line

[^.] means any character except "."

  • means 0 or more

  • means 1 or more, for example, .+ means 1 or many "any characters"

. escapes the special character .

See http://www.mikesdotnetting.com/Article/46/CSharp-Regular-Expressions-Cheat-Sheet

you can try it here: http://regexhero.net/tester/

Rui
  • 4,847
  • 3
  • 29
  • 35
  • thanks, by "whatever" I mean it can be anything. I just want to look at the domain part: xxx.yyy.xxx so the yyy part shouldn't contain foo – Marjan Sep 26 '13 at 18:19
  • I've update the answer with a regex that matches the cases you want to reject, in the example the regex will match any url that contains "whatever" in the "yyy" (domain) part – Rui Sep 27 '13 at 09:40
0

Here is your regex in java

"^[^/?]+(?<!foo)"

Explanation - From beginning search for characters which does not matches with / or ?. The moment it finds any one of the above two characters then the pattern search backward for negative match for foo. If foo is found then it returns false else true. This is in java. Also the regex will vary from language to language.

in grep cmd (unix or shell script) you have to take negation of the following regex match

"^[^/?]+foo"
Braj Kishore
  • 351
  • 2
  • 11
  • It does the opposite thing that I want to. this regex "^[^/?]+(?<!turner)" but I got this output: http://i.cdn.turner.com/cnn/.e/img/3.0/1px.gif. While I want it to match www.facebook.com/plugins/like.php?href=http%3A%2F%2Fwww.facebook.com%2Fturnerkjljl – Marjan Sep 26 '13 at 18:55
  • Good. First regex should do the exact thing while the second one will do opposite and I have mentioned it above. – Braj Kishore Sep 27 '13 at 02:41