1

What is faster having a String and then calling Substring for this String where the first Substring will be the initial String and every new Substring a smaller portion from within the initial substring until its end? Or using a StringBuilder and cutting off more and more from the beginning and then always using ToString from the shrinking string?

This link describes that the StringBuilder ToString method only allocates new space if a new thread accesses the StringBuilder or the returned String is much smaller than the currently allocated space. This sounds faster than using Substring. I can see this logic within the implementation up to .Net Framework 3.5. With version 4.0 there is an immediate call to FastAllocateString (more or less like with String Substring). Is it the same logic as in previous framework versions, only hidden now?

Why I need it:

Edit: Updated explanation: I have a string and a set of user given regexes and a context. The context tells me which regexes would be currently interesting to try to match with the start of the string. After a successful match the start of the string is now after where the last match ended. And now there may be a different context. And additionally I want to know which Regex matched with the start the last time. Since this is done multiple times the performance should be as good as possible.

In order to avoid searching the whole (remaining) string, I would use the start-anchor (^), so the search can be abborted as soon as possible (after knowing that the start does not match). If I do this I have the problem that I cannot use the start index parameter of the Match method with anything other than 0. Otherwise there would never be a match thanks to the anchor.

So I think I have two possibilities now:

1) Always using the Substring method on the input string and keeping and index for the current position.

2) Using a StringBuilder and always removing the start of the input when matched successfuly. Then calling the ToString method to see if the new start matches with the next Regex.

Is the performance of one or both of the above approaches acceptable, or is there another way which is faster?

Edit2: Regarding StringBuilder and String: As described here the implementation of StringBuilder now immediately allocates the space whereas previously this was delayed until the StringBuilder string was changed (which would have been the case all the time for my requirements). So I guess both Substring and ToString are very slow compared to the Regex-only solution from Kendall Frey.

Community
  • 1
  • 1
user764754
  • 3,865
  • 2
  • 39
  • 55

2 Answers2

3

You don't need to do this. The .NET regex engine includes a feature that lets you match a string immediately following the previous match, using the \G anchor.

Example regex:

\Gcat

Example string:

catcatcatdogcat

Matches:

(cat)(cat)(cat)dogcat

Edit: I don't think the regex lets you remember the \G anchor position between multiple regexes. Instead, you can use Regex.Match(string, int) overload to begin your string at a certain point, and use the \G anchor to match the starting position.

Kendall Frey
  • 43,130
  • 20
  • 110
  • 148
  • I think this is helpful when I only have one regex that I'm using repeatedly but I have more. – user764754 Jul 25 '12 at 18:43
  • What if you just added \G to the beginning of the regex dynamically? Where are they coming from? The user? – Kendall Frey Jul 25 '12 at 18:44
  • Yes from the user. I could add \G. But as I understand it I cannot continue with a new Regex from the last match, or am I wrong? I think this would only work in Perl – user764754 Jul 25 '12 at 18:48
  • Oh, you need multiple regexes for one string? What's the point of splitting them then? – Kendall Frey Jul 25 '12 at 18:50
  • I don't quite get what you mean with "splitting them". I have a string and a set of regexes and a context. The context tells me which regexes would be currently interesting to match with the start of the string. After a successful match the start of the string is now after where the last match ended. And now there may be a different context. And additionally I want to know which Regex matched. I think I'll update this explanation in my initial question because it may clarify things. – user764754 Jul 25 '12 at 18:56
  • I updated the answer. You can specify a starting point in the string manually for the \G anchor. – Kendall Frey Jul 25 '12 at 19:12
  • If I specify a start index then the matching wouldn't stop, if there is no match right at the start. How can i use \G to match the starting position and how do I specify a starting point int the string manually? – user764754 Jul 25 '12 at 19:25
  • The first time you perform a match, the starting position is 0. Then when a match is found, instead of cutting away the match like your original idea, you just increase the starting position by the length of the match. If you aren't finding a subsequent match as you would with `Match.NextMatch`, then \G matches at the starting index passed into `Regex.Match`. – Kendall Frey Jul 25 '12 at 19:31
  • Thanks to your explanations I got it to run now in a quick test case. Looks like the ultimate solution to me. Many thanks, much appreciated! – user764754 Jul 25 '12 at 20:25
0

A lot of string manipulation methods have versions with begin/end or begin/length pairs.

I.e. Regex.Match can be used to match substrings which is likely faster than cutting string in parts or inserting sentinel characters.

public Match Match(string input, int beginning, int length)

Note: as with any performance questions you need to try variants and measure yourself for your data.

Alexei Levenkov
  • 98,904
  • 14
  • 127
  • 179
  • The problem is that I don't know the length. It could be everything from one character up to the rest of the string. – user764754 Jul 25 '12 at 18:09