0

I have code that relies heavily on the String.Substring method, to the point where the Substring method is slowing down my application. I know that the range I want to retrieve is within the String (It is not out of bounds.) Is there something that I could use instead of Substring that would be faster? Could I write my own Substring method that could forgo any bounds checks?

sample code:

public String get_element(int element_number) {
        int count = 0;
        int start_index = 0;
        int end_index = 0;
        int current_index = 0;

        while (count < element_number && current_index != -1) {
            current_index = line_text.IndexOf(x12_reader.element_delimiter, start_index);
            start_index = current_index + 1;
            count++;
        }

        if (current_index != -1) {
            end_index = line_text.IndexOf(x12_reader.element_delimiter, start_index);
            if (end_index == -1) end_index = line_text.Length;
            return line_text.Substring(start_index, end_index - start_index); ;
        } else {
            return "";
        }
    }

I see lots of comments asking if Substring is really the problem. I know that Substring is the problem. I have run profiling in Visual Studio and It has pointed to Substring as the culprit. Also I cannot call this function any less than I currently am. The only place I have left to optimize is the Substring function. I know this is the case.

Zoran Horvat
  • 10,924
  • 3
  • 31
  • 43
Daniel
  • 27
  • 1
  • 1
  • 6
  • 6
    How did you determine that `Substring` is the bottle neck? – juharr Aug 31 '16 at 20:13
  • 1
    Can you afford storing your data as array instead of stringified DSV. For example can you at start split `line_text` to an array and simply access element at `element_number`? – Yakuza Aug 31 '16 at 20:14
  • 3
    I'd be surprised that the bounds checking is the culprit for your performance issue. – hatchet - done with SOverflow Aug 31 '16 at 20:14
  • 4
    Something makes me think your problem is elsewhere. Have you used a profiler? `Substring` actually uses `unsafe` code to operate at the pointer level. That is going to be WAY faster than anything you write in straight C#. Also, I highly doubt the bounds checking, which equates to a handful of CPU instructions, is the culprit. – dmeglio Aug 31 '16 at 20:14
  • I have used a profiler and it is telling me that Substring is the culprit. – Daniel Aug 31 '16 at 20:18
  • @Yakuza - I can't store the data as an array. I have to access it as a string. – Daniel Aug 31 '16 at 20:19
  • 1
    @dman2306 - a handful of CPU instructions is exactly what I need to shave off. This function is called millions of times. – Daniel Aug 31 '16 at 20:26
  • 1
    The time to create a new string object surely dwarfs by quite a bit the time it takes to do a couple simple int compares. The problem with trying to write a your own SubString method is that you can avoid the bounds checks (which take little time), but don't have access to the optimized internal methods used by SubString to create the new string. – hatchet - done with SOverflow Aug 31 '16 at 20:27
  • @hatchet - is there any way that I can modify the immutable string in an unsafe way so that I don't have to create a new string object? – Daniel Aug 31 '16 at 20:30
  • 1
    the point is that no code is infinitely fast, something has to take all the time. The profiling is telling you that substring is the heaviest piece of this function (which is surely is), I seriously doubt you can write one thats faster. Instead you need to look at your overall design to see if you can be more efficient. – pm100 Aug 31 '16 at 20:31
  • 2
    Hmm it would be good to know wider context, as I have feeling that your performance issue might be solved by using some better data structures adn/or architecture. Can you say what is purpose of this function in your domain? – Yakuza Aug 31 '16 at 20:31
  • 1
    "The only place I have left to optimize is the Substring function. I know this is the case." It seems like you're going into this with a very closed mind. You don't even consider the possibility of rearchitecting. This is a horrible mindset to have when approaching a problem like this. – itsme86 Aug 31 '16 at 20:37
  • If you're calling it millions of times, why don't you do some preprocessing on `line_text`? Maybe index the string or introduce some caching. – itsme86 Aug 31 '16 at 20:44
  • a 'handful of instructions' executed a million times is going to be 1 or 2 milliseconds on a modern processor. Is that the amount of time you are looking to save? – pm100 Aug 31 '16 at 20:47
  • The only thing I can think that would literally get what you're asking for would be using reflection to get access to the internal method String.InternalSubString. It's substring without bounds checking. I personally think that would be a really horrible solution, even if .net allows it to work. Re-architecting seems a much more rational path. – hatchet - done with SOverflow Aug 31 '16 at 20:54

3 Answers3

1

The problem with your function is that you are returning the substring... Therefore, I don't see the way to avoid constructing a new string.

The next question is what exactly are you doing with the result? Maybe you could change the signature of the get_element method, e.g. to receive a StringBuilder and then to copy characters from the target string instead of building the new string.

public void get_element(int element_number, StringBuilder buffer)
{
    ...
    //  instead of: return line_text.Substring(...); 
    buffer.Append(line_text, start_index, end_index - start_index);
}

Anyway, cost of constructing new strings is not too high. Maybe there are other reasons why performance is poor in your case. Maybe you are doing too many concatenations of strings returned by this method?

Zoran Horvat
  • 10,924
  • 3
  • 31
  • 43
  • I have tried using a stringbuilder instead and the performance is slower. I am calling this method millions of times. – Daniel Aug 31 '16 at 20:20
  • `StringBuilder` will lead to improved performance *only* if it is used many times in many calls to `get_element`. Can you add more code to the question? In particular, I would like to see what you are doing at the calling end, i.e. what is happening to the string once it has been returned from the `get_element` method. – Zoran Horvat Aug 31 '16 at 20:22
  • The returned string is simply stored in most cases. To be used later. The overhead of Stringbuilder is too much given the size of the strings returned. – Daniel Aug 31 '16 at 20:25
  • A see, but then again - how do you know that returning the substring is the problem? `Substring` method itself has only O(n) complexity, where n is the substring length, and it is optimized as much as you can get in .NET. – Zoran Horvat Aug 31 '16 at 20:27
  • because I have ran a profiler and it is telling me that Substring is the problem. – Daniel Aug 31 '16 at 20:31
  • 3
    The profiler will only tell you that substring is the "problem" by saying it's being called the most number of times or is taking the most CPU cycles. It cannot tell you whether your code is going to be better by rethinking the algorithm or changing it in some other fundamental way. – DavidG Aug 31 '16 at 20:53
1

I doubt the problem is caused by the bounds checking inside the Substring method.

But if you want to be sure and unsafe code is allowed, you can try the following string constructor:

public unsafe String(
    char* value,
    int startIndex,
    int length
)

like this:

public static class StringUtils
{
    public static unsafe string UnsafeSubstring(this string source, int startIndex, int length)
    {
        fixed (char* chars = source)
            return new string(chars, startIndex, length);
    }
}

Then replace Substring calls with UnsafeSubstring and see if there is any noticeable difference.

Ivan Stoev
  • 195,425
  • 15
  • 312
  • 343
  • this is (almost) identical to the normal substring without safety calls. – Dispersia Aug 31 '16 at 21:03
  • 2
    @Dispersia Correct. As I mentioned at the beginning, I doubt the issue is caused by the safety calls. So the above is just to answer OP "*Could I write my own Substring method that could forgo any bounds checks?*" – Ivan Stoev Aug 31 '16 at 21:07
-2

A string is only an array of char. You could loop through the char array between your known bounds and replace the char.

Rick the Scapegoat
  • 1,056
  • 9
  • 19
  • That's not accurate at all. A string is an object, it is not an array of characters. On top of that, a string is an *immutable* object which means you can't modify it without constructing a new object. – dmeglio Aug 31 '16 at 20:19
  • 1
    @dman2306 rick's answer on a string being only an array of char is correct. Internally it is an unsafe class that uses the CLR to hold a char* array. You can do`string test = "test"; unsafe { fixed (char* p = test) { p[0] = 'C'; } }` and that is perfectly legal, and it does not reinstantiate the string, it just uses unsafe. – Dispersia Aug 31 '16 at 20:24
  • You can modify the string from unsafe code but... that is a big no-no in .NET because everything around a string object assumes it will not change. Anyway, the question is about a different thing - here part of the string needs to be transferred elsewhere, rather than modified. – Zoran Horvat Aug 31 '16 at 20:24
  • @ZoranHorvat as long as you call fixed the GC will not hurl, it's like locking for threading, in which as long as you're sure the index exists, it is an ok operation. – Dispersia Aug 31 '16 at 20:26
  • 2
    @Dispersia This discussion is off topic, but anyway let's clarify that. You can modify the string in a way that will not crash. Anyway, different string objects may reuse the same character buffer because internally they know that buffer content cannot change. Now if you do change it, you have potentially changed several other string variables you never knew about. And that behavior would appear non-deterministically. Most notably, string literals are always reused by the compiler and you can still modify them from the unsafe code, causing completely unknown consequences. – Zoran Horvat Aug 31 '16 at 20:30
  • I don't know why this was down voted nor do I understand why using unsafe code is a no-no. While you shouldn't use unsafe to write C in C#, there are situations where unsafe is viable, such as performance critical applications - which is the crux of this question. Just look at this question. http://stackoverflow.com/questions/584134/should-you-use-pointers-unsafe-code-in-c – Rick the Scapegoat Sep 01 '16 at 15:52
  • @rick because people bandwagon and don't do research themselves. – Dispersia Nov 08 '16 at 17:55