39

I'm looking for the fastest way to replace multiple (~500) substrings of a big (~1mb) string. Whatever I have tried it seems that String.Replace is the fastest way of doing it.

I just care about the fastest possible way. Not code readability, maintainability etc. I don't care if I need to use unsafe code or pre-process the original string either.

Each replace iteration will replace ABC on the string with some other string (different per replace iteration). The string to replace will ALWAYS be the same - ABC will always be ABC. Never ABD. So if there are 400.000 thousands replace iterations. The same string - ABC - will be replaced with some other (different) string each time.

I can be in control of what ABC is. I can make it super-short or super-long as long as it doesn't affect the results. Clearly ABC can't be hello cause hello will exist as a word in most of the input strings.

Example input: ABCDABCABCDABCABCDABCABCDABCD

Example replace from string: BC

Example replace with strings: AA, BB, CC, DD, EE (5 iterations)

Example outputs:

AAADAAAAAADAAAAAADAAAAAADAAAD
ABBDABBABBDABBABBDABBABBDABBD
ACCDACCACCDACCACCDACCACCDACCD
ADDDADDADDDADDADDDADDADDDADDD
AEEDAEEAEEDAEEAEEDAEEAEEDAEED

Average case: Input string is 100-200kb with 40.000 replace iterations. Worst case: Input string is 1-2mb with 400.000 replace iterations.

I can do ANYTHING. Do it in parallel, do it unsafe, etc. It doesn't matter how I do it. What matters is that it needs to be as fast as it gets.

Null
  • 1,950
  • 9
  • 30
  • 33
Yannis
  • 6,047
  • 5
  • 43
  • 62
  • 4
    You could look at handling chunks at a time, but String.Replace would be the way to go. Can you give us more context? – liquidsnake786 Nov 26 '13 at 15:15
  • What would you like to know? Imagine a big string. I can be in control of the format/length of sub-strings to replace (e.g. #abc# instead of abc or 123 etc). But this is an operation that will happen thousands of times on the same string with the same keys but different replacement values. – Yannis Nov 26 '13 at 15:21
  • Give us an example with input and expected output. – Ahmed KRAIEM Nov 26 '13 at 15:35
  • can you alter the data structure holding the text? can you use parallel threads? – Leeor Nov 26 '13 at 15:37
  • 2
    since it is an optimization issue, please explain what is the average and worst input string, and what is the average and worst substrings. If it's only about replacing a 1 or 2 char long string, the issue will be very different from replacing very long strings. – GameAlchemist Nov 26 '13 at 15:38
  • 1
    *Whatever I have tried it seems that String.Replace is the fastest way of doing it.* It sounds like `String.Replace` is the fastest way of doing it.... – ta.speot.is Nov 29 '13 at 10:35
  • What are you doing with the strings once they've been generated? For optimization, you always have to look at the whole system, not just a part of it. – dtb Nov 29 '13 at 12:06
  • @Yannis, how fast do you need it to be? – SQB Nov 29 '13 at 13:12
  • Also, the [regex] tag is a red herring, I'm removing it. – SQB Nov 29 '13 at 13:20
  • 1
    "I just care about the fastest possible way. Not code readability, maintainability etc. I dont care if I need to use unsafe code or pre-process the original string either." You could spend the rest of your life making a better and better algorithm. There is no fastest way to do anything, how fast is it now? And how fast do you need it to be? – MrFox Nov 29 '13 at 13:27
  • 3
    Will the replacing string always be as long as the replaced string? That would save tons of time creating arrays and coping data. Also it could help if you have many texts that need to be replaced in the same source? Then the search machine could search for all of them at the same time, that could replace all in one iteration. – MrFox Nov 29 '13 at 13:43
  • @dtb other components have already been heavily optimized. The String.Replace is where I spend sometihng like 80% of my time in the application. – Yannis Nov 29 '13 at 15:28
  • @MrFox As mentioned in another comment. The String.Replace is where I spend sometihng like 80% of my time in the application. You understand that if I lets say make it faster by 20% then the whole program becomes 15-16% faster. You get the idea. – Yannis Nov 29 '13 at 15:30
  • @Yannis Yea that comment was a bit whiny, could you comment on the operation properties I asked about later on? Those could make some big optimizations possible. – MrFox Nov 29 '13 at 15:50
  • 3
    Do you definitely need the result as a string, rather than (say) a char array? If you can decide the delimiter beforehand, and it will never occur in the real string, do you really need the string in one chunk? It feels like you've naturally just got chunks. It would really help if you'd describe more of the *purpose* of this... – Jon Skeet Nov 30 '13 at 09:17
  • 3
    Is this an attempt to change a DNA? :) – Larry Nov 30 '13 at 18:22
  • 1
    Do you want to keep all of the results in memory? Looking at your worst case scenario: 2Mb with 400 000 replacements means 800 000Mb (781.25Gb) worth of data in memory for the results – flipchart Dec 02 '13 at 10:42
  • Also, if you're truly interested in performance, would you be apposed to solutions in other languages? This would be a great task for CUDA, although I'm not sure it would actually be faster with all the marshalling back and forth between managed and native – flipchart Dec 02 '13 at 11:31
  • @Laurent - No. Although the example makes it seem like it is. Its a general question about templating a long string. – Yannis Dec 02 '13 at 13:11
  • @flipchart - Results aren't stored in memory. In fact, after some replacements they get persisted. That bit is already optimized. So no - I dont have to neither I do keep all results in memory. – Yannis Dec 02 '13 at 13:12
  • @JonSkeet - I have tried the char array and essentially doing a char[].replace (in a way). I dont think thats the faster option though. But I agree that the overhead of using strings is not necessary here. – Yannis Dec 02 '13 at 13:14
  • Has someone tried to use a lex/yacc approach in this case ? – Larry Dec 02 '13 at 17:54
  • It *does* look like DNA modelling: exactly 4 bases, very long strings, rule based substitutions... the only other bell it rings is one time xor pads. Hey, you aren't using steganography to hide xor pads in what otherwise appears to be DNA data? It's intriguing that you're so concerned with speed; of all the applications I can imagine, only cryptography is so time sensitive. – Peter Wone Dec 03 '13 at 13:10
  • @PeterWone - Why would I lie that its not DNA? Its not like I have my DNA sequence in the example above and you were planning to clone me or something. I m concerned about speed cause the application in discussion spends 80% of the time (after all other optimizations) replacing sub-strings of strings. Also, I never said that I have 4 bases. If you read the detailed description above you have thousands of replace operations. – Yannis Dec 03 '13 at 16:23
  • @yannis - I didn't say that and I didn't insinuate anything of the sort. I'm sorry you got that impression. – Peter Wone Dec 05 '13 at 02:06
  • 1
    Almost certainly, the fastest way would be an unsafe C++ string Transducer based on either Boyer-Moore or KMP searching. – RBarryYoung Dec 05 '13 at 03:00

7 Answers7

4

Using unsafe and compiled as x64

result:

Implementation       | Exec   | GC
#1 Simple            | 4706ms |  0ms
#2 Simple parallel   | 2265ms |  0ms
#3 ParallelSubstring |  800ms | 21ms
#4 Fredou unsafe     |  432ms | 15ms

take the code of Erti-Chris Eelmaa and replace my previous one with this.

I don't think I will do another iteration but i did learn a few thing with unsafe which is a good thing :-)

    private unsafe static void FredouImplementation(string input, int inputLength, string replace, string[] replaceBy)
    {
        var indexes = new List<int>();

        //input = "ABCDABCABCDABCABCDABCABCDABCD";
        //inputLength = input.Length;
        //replaceBy = new string[] { "AA", "BB", "CC", "DD", "EE" };

        //my own string.indexof to save a few ms
        int len = inputLength;

        fixed (char* i = input, r = replace)
        {
            int replaceValAsInt = *((int*)r);

            while (--len > -1)
            {
                if (replaceValAsInt == *((int*)&i[len]))
                {
                    indexes.Add(len--);
                }
            }                
        }

        var idx = indexes.ToArray();
        len = indexes.Count;

        Parallel.For(0, replaceBy.Length, l =>
            Process(input, inputLength, replaceBy[l], idx, len)
        );
    }

    private unsafe static void Process(string input, int len, string replaceBy, int[] idx, int idxLen)
    {
        var output = new char[len];

        fixed (char* o = output, i = input, r = replaceBy)
        {
            int replaceByValAsInt = *((int*)r);

            //direct copy, simulate string.copy
            while (--len > -1)
            {
                o[len] = i[len];
            }

            while (--idxLen > -1)
            {
                ((int*)&o[idx[idxLen]])[0] = replaceByValAsInt;
            }
        }

        //Console.WriteLine(output);
    }
Fredou
  • 19,848
  • 10
  • 58
  • 113
  • im going to look at unsafe later today :-) – Fredou Nov 30 '13 at 18:25
  • about 33% faster with unsafe – Fredou Dec 01 '13 at 02:57
  • That's nice. I see you're doing some things, such as declaring variables outside loop, so they wouldn't get repeated(such as by/inputLength) - does this actually win you some milliseconds? I am wondering because one would expect C# compiler do the magic, but what I've seen, it really doesn't do such simple things. I can only suggest you to implement substring parallelization too, and see if you can get it even faster than 250ms or so. My #3d solution focuses too much time in GC, I will try to declare them "globally" as you do(output) – Erti-Chris Eelmaa Dec 01 '13 at 11:48
  • @Erti-ChrisEelmaa you need to look at the il produced to know why i'm moving the variable inside the parallel loop – Fredou Dec 01 '13 at 18:25
  • This is a VERY good solution @Fredou - Will keep you posted! Quick q for now. Why are you iterating 5 times in your solution (where you do the parallel.for)? – Yannis Dec 02 '13 at 13:17
  • @Fredou your solution is very good but unfortunately it doesnt work for variable length replacements. For instance "ABCDEFG".Replace("AB", "FREDOU") == false – Yannis Dec 02 '13 at 13:45
  • @Yannis, it is "hardcoded" to be a 2 chars replacement only, i will see if i can make it variable – Fredou Dec 02 '13 at 13:50
  • @Yannis, since you want to expand the string, I don't think the unsafe way could easily work since you have to work with fixed array size. sorry. – Fredou Dec 02 '13 at 15:06
  • @Yannis, I do the loop 5 time to make sure the speed is consistent :-) – Fredou Dec 02 '13 at 15:20
  • 1
    @Fredou; I have updated my post. I've upgraded it to 1MB string now. Note that you have to measure GC too. That means measuring the "output = new string[]" part, otherwise you're just cheating :P. If we are dealing with OP data(2MB string + 500 replacements), your solution would run out of memory, since you allocate everything at once(that is - on a 32-bit system). – Erti-Chris Eelmaa Dec 02 '13 at 16:59
  • @Erti-ChrisEelmaa, oops yeah your right, in one of my revision i think i moved the string.copy out of the stopwatch by mistake. im going to change(removing it in fact) it in a few minutes because this make no sense at all! thanks – Fredou Dec 02 '13 at 18:09
  • in fact back in a few hour :-) – Fredou Dec 02 '13 at 18:32
  • @Erti-ChrisEelmaa, took a while but i was able to fix my mistake :-) i stole your idea of doing indexof before doing the big chunk of work – Fredou Dec 03 '13 at 00:40
  • Fredou, @JulianR - Erti-ChrisElmaas implementation(s) are more robust cause they allow variable size replace string (e.g. replace ABC with SOMESTRING and QWERTY with QRT). So i dont think performance comparisons between the three solutions make much sense (despite the excellent implementations) – Yannis Dec 03 '13 at 16:40
  • @Yannis, since i did some major refactoring since my previous comment about expanding under unsafe, I think now I can make this work. not sure if i have time right now to make the change so i will update this in a few hours – Fredou Dec 03 '13 at 16:49
  • @Fredou - Just to clarify. I dont want you to fix it :) I am just looking for the general idea here. I m just saying that comparing the results is a bit unfair but I certainly get the point. – Yannis Dec 03 '13 at 19:16
  • @Yannis I like the challenge so I'm also doing it for myself :-) – Fredou Dec 03 '13 at 19:29
  • @Yannis, i was able to implement a variable size find/replace but it wasn't worthed compared to Erti code so I just updated my code with a bit faster 2 chars replacement – Fredou Dec 04 '13 at 15:53
  • Indeed my friends, it has been nice learning process for me too. I've added my unsafe implementation too. I see another 100ms that can be won with great tricks, but not gonna do it for now :-) – Erti-Chris Eelmaa Dec 04 '13 at 17:59
  • @Erti-ChrisEelmaa i'm going to have to look at the unmanaged code later today :-) – Fredou Dec 04 '13 at 19:35
2

I had a similar issue on a project and I've implemented a Regex solution to perform multiple and case insensitive replacements on a file.

For efficiency purposes, I set criteria to go through the original string only once.

I've published a simple console app to test some strategies on https://github.com/nmcc/Spikes/tree/master/StringMultipleReplacements

The code for the Regex solution is similar to this:

Dictionary<string, string> replacements = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase);
    // Fill the dictionary with the proper replacements:

        StringBuilder patternBuilder = new StringBuilder();
                patternBuilder.Append('(');

                bool firstReplacement = true;

                foreach (var replacement in replacements.Keys)
                {
                    if (!firstReplacement)
                        patternBuilder.Append('|');
                    else
                        firstReplacement = false;

                    patternBuilder.Append('(');
                    patternBuilder.Append(Regex.Escape(replacement));
                    patternBuilder.Append(')');
                }
                patternBuilder.Append(')');

                var regex = new Regex(patternBuilder.ToString(), RegexOptions.IgnoreCase);

                return regex.Replace(sourceContent, new MatchEvaluator(match => replacements[match.Groups[1].Value]));

EDIT: The execution times running the test application on my computer are:

  • Looping through the replacements calling string.Substring() (CASE SENSITIVE): 2ms
  • Single pass using Regex with multiple replacements at once (Case insensitive): 8ms
  • Looping through replacements using a ReplaceIgnoreCase Extension (Case insensitive): 55ms
Community
  • 1
  • 1
nunoc
  • 31
  • 2
1

It sounds like you are tokenising the string? I would look at producing a buffer and indexing your tokens. Or using a templating engine

As a naive example you could use code generation to make the following method

public string Produce(string tokenValue){

    var builder = new StringBuilder();
    builder.Append("A");
    builder.Append(tokenValue);
    builder.Append("D");

    return builder.ToString();

}

If your running the iterations enough times, the time to build the template will pay for itself. You can then also call that method in parallel with no side effects. Also look at interning your strings

Adam Mills
  • 7,719
  • 3
  • 31
  • 47
1

You probably won't get anything faster than String.Replace (unless you go native) because iirc String.Replace is implemented in CLR itself for maximum performance. If you want 100% performance, you can conveniently interface with native ASM code via C++/CLI and go from there.

Nemo
  • 129
  • 7
  • It seems (at least in .NET 6) String.Replace is not implemented (any more?) in native, but in managed code: https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/String.Manipulation.cs – David Liebeherr Aug 22 '22 at 00:10
1

My approach is a little like templating - it takes the input string and pulls out (removes) the substrings that are to be replaced. Then it takes the remaining parts of the string (the template) and combines them with the new replacement substrings. This is done in a parallel operation (template + each replacement string), which builds the output strings.

I think what I am explaining above may be clearer with code. This uses your sample inputs from above:

const char splitter = '\t';   // use a char that will not appear in your string

string input = "ABCDABCABCDABCABCDABCABCDABCD";
string oldString = "BC";
string[] newStrings = { "AA", "BB", "CC", "DD", "EE" };

// In input, replace oldString with tabs, so that we can do String.Split later
var inputTabbed = input.Replace(oldString, splitter.ToString());
// ABCDABCABCDABCABCDABCABCDABCD --> A\tDA\tA\tDA\tA\tDA\tA\tDA\tD

var inputs = inputTabbed.Split(splitter);
/* inputs (the template) now contains:
[0] "A" 
[1] "DA"
[2] "A" 
[3] "DA"
[4] "A" 
[5] "DA"
[6] "A" 
[7] "DA"
[8] "D" 
*/

// In parallel, build the output using the template (inputs)
// and the replacement strings (newStrings)
var outputs = new List<string>();
Parallel.ForEach(newStrings, iteration =>
    {
        var output = string.Join(iteration, inputs);
        // only lock the list operation
        lock (outputs) { outputs.Add(output); }
    });

foreach (var output in outputs)
    Console.WriteLine(output);

Output:

AAADAAAAAADAAAAAADAAAAAADAAAD
ABBDABBABBDABBABBDABBABBDABBD
ACCDACCACCDACCACCDACCACCDACCD
ADDDADDADDDADDADDDADDADDDADDD
AEEDAEEAEEDAEEAEEDAEEAEEDAEED

So you can do a comparison, here is a complete method which can be used in the test code by Erti-Chris Eelmaa:

private static void TemplatingImp(string input, string replaceWhat, IEnumerable<string> replaceIterations)
{
    const char splitter = '\t';   // use a char that will not appear in your string

    var inputTabbed = input.Replace(replaceWhat, splitter.ToString());
    var inputs = inputTabbed.Split(splitter);

    // In parallel, build the output using the split parts (inputs)
    // and the replacement strings (newStrings)
    //var outputs = new List<string>();
    Parallel.ForEach(replaceIterations, iteration =>
    {
        var output = string.Join(iteration, inputs);
    });
}
chue x
  • 18,573
  • 7
  • 56
  • 70
1

I made a variation on Fredou's code that requires less compares as it works on int* instead of char*. It still requires n iterations for a string of n length, it just has to do less comparing. You could have n/2 iterations if the string is neatly aligned by 2 (so the string to replace can only occur at indexes 0, 2, 4, 6, 8, etc) or even n/4 if it's aligned by 4 (you'd use long*). I'm not very good at bit fiddling like this, so someone might be able to find some obvious flaw in my code that could be more efficient. I verified that the result of my variation is the same as that of the simple string.Replace.

Additionally, I expect that some gains could be made in the 500x string.Copy that it does, but haven't looked into that yet.

My results (Fredou II):

IMPLEMENTATION       |  EXEC MS | GC MS
#1 Simple            |     6816 |     0
#2 Simple parallel   |     4202 |     0
#3 ParallelSubstring |    27839 |     4
#4 Fredou I          |     2103 |   106
#5 Fredou II         |     1334 |    91

So about 2/3 of the time (x86, but x64 was about the same).

For this code:

private unsafe struct TwoCharStringChunk
{
  public fixed char chars[2];
}

private unsafe static void FredouImplementation_Variation1(string input, int inputLength, string replace, TwoCharStringChunk[] replaceBy)
{
  var output = new string[replaceBy.Length];

  for (var i = 0; i < replaceBy.Length; ++i)
    output[i] = string.Copy(input);

  var r = new TwoCharStringChunk();
  r.chars[0] = replace[0];
  r.chars[1] = replace[1];

  _staticToReplace = r;

  Parallel.For(0, replaceBy.Length, l => Process_Variation1(output[l], input, inputLength, replaceBy[l]));
}

private static TwoCharStringChunk _staticToReplace ;

private static unsafe void Process_Variation1(string output, string input, int len, TwoCharStringChunk replaceBy)
{
  int n = 0;
  int m = len - 1;

  fixed (char* i = input, o = output, chars = _staticToReplace .chars)
  {
    var replaceValAsInt = *((int*)chars);
    var replaceByValAsInt = *((int*)replaceBy.chars);

    while (n < m)
    {
      var compareInput = *((int*)&i[n]);

      if (compareInput == replaceValAsInt)
      {
        ((int*)&o[n])[0] = replaceByValAsInt;
        n += 2;
      }
      else
      {
        ++n;
      }
    }
  }
}

The struct with the fixed buffer is not strictly necessary here and could have been replaced with a simple int field, but expand the char[2] to char[3] and this code can be made to work with three letter strings as well, which wouldn't be possible if it was an int field.

It required some changes to the Program.cs as well, so here's the full gist:

https://gist.github.com/JulianR/7763857

EDIT: I'm not sure why my ParallelSubstring is so slow. I'm running .NET 4 in Release mode, no debugger, in either x86 or x64.

JulianR
  • 16,213
  • 5
  • 55
  • 85
  • you should adapt your solution with my updated code. i don't use string.copy anymore which is freaking slow :-) I like the int32 approach. I'm still learning how to do unsafe code :-) – Fredou Dec 03 '13 at 15:16
  • i did a quick test on my side and with a fixed struct of 2 char with my current solution, it save about 2-4% on runtime which seem good. i will let you do the real thing :-) – Fredou Dec 03 '13 at 16:11
  • Can you fire up profiler and see why Parallel substring is so slow :P? Believe it or not, my solution is still fastest one on my computer. How many cores do you have? – Erti-Chris Eelmaa Dec 03 '13 at 17:27
  • I have two cores at home, but my/Fredou's implementation is multicore as well so it should scale similarly. – JulianR Dec 03 '13 at 18:17
0

As your input string can be as long as 2Mb, I don't foresee any memory allocation problem. You can load everything in memory and replace your data.

If from BC you ALWAYS needs to replace for AA, a String.Replace will be ok. But, if you need more control, you could use a Regex.Replace:

var input  = "ABCDABCABCDABCABCDABCABCDABCD";
var output = Regex.Replace(input, "BC", (match) =>
{
    // here you can add some juice, like counters, etc
    return "AA";
});
Rubens Farias
  • 57,174
  • 8
  • 131
  • 162
  • I have tried that and it's actually slower than string.Replace(..) – Yannis Nov 26 '13 at 16:27
  • Did you try to compile the regex? – Bidou Nov 29 '13 at 10:32
  • `Regex` is not always fast. But is much more robust for strange cases. – Vitor Canova Nov 29 '13 at 11:51
  • 2
    @VitorCanova: What exactly do you mean by "much more robust"? Both `string.Replace` and `Regex.Replace` are good at doing what they're meant to do. If you're trying to replace text based on a pattern, then a regular expression is almost certainly the right approach - but if you're not using a pattern, `Replace` should be fine. What sort of case are you thinking about? – Jon Skeet Nov 30 '13 at 09:18
  • Sorry, I think a made a mistake when I took the word "robust". What I really meant is exactly what you said. It is for more complex things. Just like `StringBuilder` is not the best choice when making concatenation of two little strings but for a bunch of strings with no known size and repetition it is the best. ;) – Vitor Canova Nov 30 '13 at 17:12