93

Is there a better way than this to convert a MatchCollection to a string array?

MatchCollection mc = Regex.Matches(strText, @"\b[A-Za-z-']+\b");
string[] strArray = new string[mc.Count];
for (int i = 0; i < mc.Count;i++ )
{
    strArray[i] = mc[i].Groups[0].Value;
}

P.S.: mc.CopyTo(strArray,0) throws an exception:

At least one element in the source array could not be cast down to the destination array type.

ggorlen
  • 44,755
  • 7
  • 76
  • 106
Vildan
  • 1,934
  • 1
  • 16
  • 17

6 Answers6

201

Try:

var arr = Regex.Matches(strText, @"\b[A-Za-z-']+\b")
    .Cast<Match>()
    .Select(m => m.Value)
    .ToArray();
Dave Bish
  • 19,263
  • 7
  • 46
  • 63
  • 1
    I would have used `OfType()` for this instead of `Cast()` ... Then again, the outcome would be the same. – Alex Jul 10 '12 at 15:05
  • 5
    @Alex You know that everything returned will be a `Match`, so there's no need to check it again at runtime. `Cast` makes more sense. – Servy Jul 10 '12 at 15:08
  • 2
    @DaveBish I posted some sort-of benchmarking code below, `OfType<>` turns out to be slightly faster. – Alex Jul 10 '12 at 15:29
  • @DaveBish: don't worry about OfType vs Cast performance. Your #1 performance dog is `Regex.Matches`. – user7116 Jul 11 '12 at 13:20
  • for future visitors, this have been argued: http://stackoverflow.com/a/11432268/187510 – Letterman Dec 07 '13 at 20:26
  • Is there any reason not to do this: var arr = Regex.Matches(strText, @"\b[A-Za-z-']+\b") .Cast() .Select(m => m.Value) .ToArray(); Since there aren't any parenthesis in the pattern I don't see a reason to use the Group property unless it doesn't do what I think it does. – Teeknow Mar 05 '14 at 19:14
  • 1
    @Frontenderman - Nope, I was just aligning it with the askers question – Dave Bish Mar 06 '14 at 16:48
  • 1
    You would think it would be a simple command to turn a `MatchCollection` into a `string[]`, as it is for `Match.ToString()`. It's pretty obvious the final type needed in a lot of `Regex` uses would be a string, so it should have been easy to convert. – n00dles Jun 10 '17 at 16:19
  • 1
    @n00dles I agree, though the first annoying thing is having to deal with a non-generic ICollection and IEnumerable type, though to be totally fair, I'm pretty sure this API was made prior even to generic C# support. – Nicholas Petersen Feb 14 '18 at 18:14
33

Dave Bish's answer is good and works properly.

It's worth noting although that replacing Cast<Match>() with OfType<Match>() will speed things up.

Code wold become:

var arr = Regex.Matches(strText, @"\b[A-Za-z-']+\b")
    .OfType<Match>()
    .Select(m => m.Groups[0].Value)
    .ToArray();

Result is exactly the same (and addresses OP's issue the exact same way) but for huge strings it's faster.

Test code:

// put it in a console application
static void Test()
{
    Stopwatch sw = new Stopwatch();
    StringBuilder sb = new StringBuilder();
    string strText = "this will become a very long string after my code has done appending it to the stringbuilder ";

    Enumerable.Range(1, 100000).ToList().ForEach(i => sb.Append(strText));
    strText = sb.ToString();

    sw.Start();
    var arr = Regex.Matches(strText, @"\b[A-Za-z-']+\b")
              .OfType<Match>()
              .Select(m => m.Groups[0].Value)
              .ToArray();
    sw.Stop();

    Console.WriteLine("OfType: " + sw.ElapsedMilliseconds.ToString());
    sw.Reset();

    sw.Start();
    var arr2 = Regex.Matches(strText, @"\b[A-Za-z-']+\b")
              .Cast<Match>()
              .Select(m => m.Groups[0].Value)
              .ToArray();
    sw.Stop();
    Console.WriteLine("Cast: " + sw.ElapsedMilliseconds.ToString());
}

Output follows:

OfType: 6540
Cast: 8743

For very long strings Cast() is therefore slower.

Alex
  • 23,004
  • 4
  • 39
  • 73
  • 1
    Very surprising! Given that OfType must do an 'is' comparison somewhere inside and a cast (I'd have thought?) Any ideas on why Cast<> is slower? I've got nothing! – Dave Bish Jul 11 '12 at 08:51
  • I honestly don't have a clue, but it "feels" right to me (OfType<> is just a filter, Cast<> is ... well, is a cast) – Alex Jul 11 '12 at 09:55
  • More benchmarks seem to show this particular result is due to regex more than specific linq extension used – Alex Jul 11 '12 at 13:14
6

I ran the exact same benchmark that Alex has posted and found that sometimes Cast was faster and sometimes OfType was faster, but the difference between both was negligible. However, while ugly, the for loop is consistently faster than both of the other two.

Stopwatch sw = new Stopwatch();
StringBuilder sb = new StringBuilder();
string strText = "this will become a very long string after my code has done appending it to the stringbuilder ";
Enumerable.Range(1, 100000).ToList().ForEach(i => sb.Append(strText));
strText = sb.ToString();

//First two benchmarks

sw.Start();
MatchCollection mc = Regex.Matches(strText, @"\b[A-Za-z-']+\b");
var matches = new string[mc.Count];
for (int i = 0; i < matches.Length; i++)
{
    matches[i] = mc[i].ToString();
}
sw.Stop();

Results:

OfType: 3462
Cast: 3499
For: 2650
David DeMar
  • 2,390
  • 2
  • 32
  • 45
  • no surprise that linq is slower than for loop. Linq may be easier to write for some people and "increase" their productivity at the expense executing time. that can be good sometimes – gg89 Sep 23 '15 at 06:01
  • 1
    So the original post is the most efficient method really. – n00dles Jun 10 '17 at 16:21
4

One could also make use of this extension method to deal with the annoyance of MatchCollection not being generic. Not that it's a big deal, but this is almost certainly more performant than OfType or Cast, because it's just enumerating, which both of those also have to do.

(Side note: I wonder if it would be possible for the .NET team to make MatchCollection inherit generic versions of ICollection and IEnumerable in the future? Then we wouldn't need this extra step to immediately have LINQ transforms available).

public static IEnumerable<Match> ToEnumerable(this MatchCollection mc)
{
    if (mc != null) {
        foreach (Match m in mc)
            yield return m;
    }
}
Lauren Rutledge
  • 1,195
  • 5
  • 18
  • 27
Nicholas Petersen
  • 9,104
  • 7
  • 59
  • 69
0

Consider the following code...

var emailAddress = "joe@sad.com; joe@happy.com; joe@elated.com";
List<string> emails = new List<string>();
emails = Regex.Matches(emailAddress, @"([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})")
                .Cast<Match>()
                .Select(m => m.Groups[0].Value)
                .ToList();
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
gpmurthy
  • 2,397
  • 19
  • 21
  • 1
    ugh... That regex is horrendous to look at. BTW, as there doesn't exist a foolproof regex for validating emails, use the MailAddress object. http://stackoverflow.com/a/201378/2437521 – C. Tewalt Aug 05 '14 at 18:19
0

If you need a recursive capture, eg. Tokenizing Math Equations:

//INPUT (I need this tokenized to do math)
    string sTests = "(1234+5678)/ (56.78-   1234   )";
            
    Regex splitter = new Regex(@"([\d,\.]+|\D)+");
    Match match = splitter.Match(sTests.Replace(" ", ""));
    string[] captures = (from capture in match.Groups.Cast<Group>().Last().Captures.Cast<Capture>()
                         select capture.Value).ToArray();

...because you need to go after the last captures in the last group.

mike
  • 2,149
  • 20
  • 29