9

I got a collection of string and all i want for regex is to collect all started with http..

href="http://www.test.com/cat/1-one_piece_episodes/"href="http://www.test.com/cat/2-movies_english_subbed/"href="http://www.test.com/cat/3-english_dubbed/"href="http://www.exclude.com"

this is my regular expression pattern..

href="(.*?)[^#]"

and return this

href="http://www.test.com/cat/1-one_piece_episodes/"
href="http://www.test.com/cat/2-movies_english_subbed/"
href="http://www.xxxx.com/cat/3-english_dubbed/"
href="http://www.exclude.com"

what is the pattern for excluding the last match.. or excluding matches that has the exclude domain inside like href="http://www.exclude.com"

EDIT: for multiple exclusion

href="((?:(?!"|\bexclude\b|\bxxxx\b).)*)[^#]"
Vincent Dagpin
  • 3,581
  • 13
  • 55
  • 85
  • Would you want the url `http://www.test.com/fish/exclude` included? what about `http://www.exclude.co.uk` or `http://www.exclude.test.com` – Bob Vale Aug 05 '11 at 12:24

3 Answers3

17

@ridgerunner and me would change the regex to:

href="((?:(?!\bexclude\b)[^"])*)[^#]"

It matches all href attributes as long as they don't end in # and don't contain the word exclude.

Explanation:

href="     # Match href="
(          # Capture...
 (?:       # the following group:
  (?!      # Look ahead to check that the next part of the string isn't...
   \b      # the entire word
   exclude # exclude
   \b      # (\b are word boundary anchors)
  )        # End of lookahead
  [^"]     # If successful, match any character except for a quote
 )*        # Repeat as often as possible
)          # End of capturing group 1
[^#]"      # Match a non-# character and the closing quote.

To allow multiple "forbidden words":

href="((?:(?!\b(?:exclude|this|too)\b)[^"])*)[^#]"
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
2

Your input doesn't look like a valid string (unless you escape the quotes in them) but you can do it without regex too:

string input = "href=\"http://www.test.com/cat/1-one_piece_episodes/\"href=\"http://www.test.com/cat/2-movies_english_subbed/\"href=\"http://www.test.com/cat/3-english_dubbed/\"href=\"http://www.exclude.com\"";

List<string> matches = new List<string>();

foreach(var match in input.split(new string[]{"href"})) {
   if(!match.Contains("exclude.com"))
      matches.Add("href" + match);
}
Mrchief
  • 75,126
  • 20
  • 142
  • 189
0

Will this do the job?

href="(?!http://[^/"]+exclude.com)(.*?)[^#]"
Bob Vale
  • 18,094
  • 1
  • 42
  • 49