-1

So Im pretty new to regex, I already use it in my project succefully, but only to find one specific match.

Now im trying to find all matches of a certain url pattern inside an html source code.

The urls are like this:

Link example 1: https://clips.twitch.tv/KindYummyCarrotPeteZaroll?tt_content=video_thumbnail

Link example 2: https://clips.twitch.tv/AmericanOilyMeerkatSaltBae?tt_content=video_thumbnail

I have this code searching for the links:

       MatchCollection matches = Regex.Matches(source, @"^(https://clips.twitch.tv/)+(.*?)+(video_thumbnail)$");

        if (matches.Count <= 0)
        {
            MessageBox.Show(matches.Count.ToString() + " urls found");
        }
        else
        {
            MessageBox.Show(matches.Count.ToString() + " urls");
        }

My first instinc was that the source string was somehow wrong, so I tried this regex in this string:

string source = (" adsfgsdfg adsfg assdfg https://clips.twitch.tv/KindYummyCarrotPeteZaroll?tt_content=video_thumbnail dfgsdfgszdfg asdfg https://clips.twitch.tv/AmericanOilyMeerkatSaltBae?tt_content=video_thumbnailsadfgdf g");

I have tried also this regex:

Regex.Matches(source, @"^(https://clips.twitch.tv/)+([a-z0-9A-Z]{1,100})+(\?)+(tt_content=video_thumbnail)$");

But the result is always 0 urls found.

What im doing wrong?

Daniel A. White
  • 187,200
  • 47
  • 362
  • 445
Ruben V.
  • 1
  • 1
  • Try taking out the `^` and `$`, and wrapping the entire pattern in parenthesis. – gunr2171 Jun 07 '18 at 19:30
  • @gunr2171 Why do you suggest *wrapping the entire pattern in parenthesis*? There is no need here, as OP is *extracting* the text. Unless you are splitting, but it is not the case. – Wiktor Stribiżew Jun 07 '18 at 19:35
  • @WiktorStribiżew, you're right. I'm used to using a capture group for the text I want to extra, which in this case would be the entire pattern. But the resulting object will give you that information without the need for the capture group. – gunr2171 Jun 07 '18 at 19:43
  • @gunr2171 Thanks a lot, I didnt know ^ and $ where so the string matched the regex perfectly. – Ruben V. Jun 07 '18 at 19:47

1 Answers1

0

Your regex pattern had unescaped characters in it. The . has a special regex uses, so to indicate that you just mean an actual period they must have a backslash before them. Try this:

(https://clips\.twitch\.tv/)(?:(?!http).)*?(video_thumbnail)

Note also that the ^ and $ are gone; if you include those, it will only match if the entire string matches.

N.D.C.
  • 1,601
  • 10
  • 13
  • 1
    `/` should not be escaped in .NET regex, it is not a special character in any regex, BTW. It may be used as a delimiter, but .NET regex do not use delimiters. – Wiktor Stribiżew Jun 07 '18 at 19:31
  • @WiktorStribiżew The regex parser I use (regex101.com) throws an error if `/` is not escaped. I should double check to see if C# requires this. – N.D.C. Jun 07 '18 at 19:32
  • regex101.com does not support .NET regex, why mention it? And it [supports unescaped `/`](https://regex101.com/r/iH8GCx/1). – Wiktor Stribiżew Jun 07 '18 at 19:33
  • @WiktorStribiżew you are right. I used the site out of habit and forgot about the difference. I have edited the post. – N.D.C. Jun 07 '18 at 19:34
  • Well, mind that `(...)+` repeats the whole grouping construct pattern 1 or more times, and `(.*?)+` makes too little sense. Note that `.` matches any char but a newline, so the pattern may overmatch, (i.e. match several `https` links on a line starting with the first, and ending with the one that has `video_thumbnail` matching all other URLs on its way. You might at least change `(.*?)+` into `(?:(?!http).)*?`. But it is not really clear here. – Wiktor Stribiżew Jun 07 '18 at 19:38
  • @WiktorStribiżew thank you for the suggestion, you raise a lot of good points here and I've added your change. My testing shows that it's made it less likely to put two links together, as you suspected. – N.D.C. Jun 07 '18 at 19:45
  • Taking out the ^ and $ worked, I didnt know it was so the regex match the entire string! – Ruben V. Jun 07 '18 at 19:46