0

I want, given a webpage to extract every occurance of delimited string. I use regex to achieve that, like this

Regex Rx = new Regex(before + "(.*?)" + after);

if (o is string)
{
    string s = o as string;
    List<string> results = new List<string>();
    foreach (Match match in Rx.Matches(s))
    {
        results.Add(match.ToString().Replace(before, "").Replace(after, ""));
    }
    return results.ToArray();
}

My input is html string containing this text

<script type="text/javascript">
            var s1 = new SWFObject("http://hornbunny.com/player.swf","ply","610","480","10","#000000");
            s1.addParam("allowfullscreen","true");
            s1.addParam("allowscriptaccess","always");
                            s1.addParam("flashvars","overlay=http://cdn1.image.somesite.com/thumbs/0/9/e/1/2/09e12f7aeec382bc63a620622ff535b6/09e12f7aeec382bc63a620622ff535b6.flv-3b.jpg&settings=http://somesite.com/playerConfig.php?09e12f7aeec382bc63a620622ff535b6.flv|0");
            s1.write("myAlternativeContent");
        </script>

The result I get is string[] with 0 elements because foreach (Match match in Rx.Matches(s)) loops 0 times.

But it maches exactly 0 times, though there is at least 1 occurance in my document. I tried to extract the strings between var s1 = new SWFObject and </script> as delimiters, so there are no special chars, even that I didn't escaped my strings.

What seems to be wrong with that regex?

Working:

 if (o is string)
            {
                string s = o as string;
                List<string> results = new List<string>();
                foreach (Match match in Rx.Matches(s))
                {
                    results.Add(match.Groups[1].Value);
                }
                return results.ToArray();
            }
JDE
  • 294
  • 1
  • 12
  • you need to give your content in `before` and `after` with an example..`.*?` is lazy in nature..it will consume as less as possible before terminating..so it is happy to match 0 character – rock321987 Jun 18 '16 at 13:58
  • Ok so withowt the question mark? But I want the regex to be lazy so it gets the smallest possible matches, I don't want to find my delimiters in the matched – JDE Jun 18 '16 at 14:08
  • Use `RegexOptions.Singleline` if your string contains newline characters. – Lucas Trzesniewski Jun 18 '16 at 14:10
  • @JDE you need to provide an example – rock321987 Jun 18 '16 at 14:12
  • of your input and output – rock321987 Jun 18 '16 at 14:13
  • also from what I am seeing you can use lookahead and lookbehind instead of using replace – rock321987 Jun 18 '16 at 14:16
  • Don't forget that your regex is case sensitive by default. If the delimiters in your regex are in a different case than the search text you won't find a match. You can make the regex case insensitive by setting the options on the Regex object. Also, since you have a capture group in your regex you can simplify your results.Add line to this: `results.Add(match.Groups[1].Value);` – Francis Gagnon Jun 18 '16 at 14:18
  • now comes the classical question.. **[`Don't parse HTML with regex`](http://stackoverflow.com/a/1732454/1996394)** – rock321987 Jun 18 '16 at 14:21
  • this is irrelevant, I want just an url out of it – JDE Jun 18 '16 at 14:25

1 Answers1

0

The .*? matches any character except newline without the RegexOptions.Singleline option. So, unless it's all on one line it won't match newline separators.

So we arrive at ((.|\s)*) = match any character or newline between 0 and unlimited times. OR if we use RegexOptions.Singleline we can reduce the regex to (.*)

Edit: Working example.

var before = "var s1 = new SWFObject";
var after = "</script>";
var о = @"var s1 = new SWFObject(d
aw
da
wd
awd
aw
d
aw
d
awd
        </script> ";
Regex Rx = new Regex(before + "(.*)" + after,RegexOptions.Singleline);


if (о is string)
{
    string s = о as string;
    List<string> results = new List<string>();
    foreach (Match match in Rx.Matches(s))
    {
        results.Add(match.Groups[1].Value);
    }
      results.ToArray().Dump();
}
Nikola Sivkov
  • 2,812
  • 3
  • 37
  • 63