0

I have before me the problem of searching for 90,000 GUIDs in a string. I need to get all instances of each GUID. Technically, I need to replace them too, but that's another story.

Currently, I'm searching for each one individually, using a regex. But it occurred to me that I could get better performance by searching for them all together. I've read about tries in the past and have never used one but it occurred to me that I could construct a trie of all 90,000 GUIDs and use that to search.

Alternatively, perhaps there is an existing library in .NET that can do this. It crossed my mind that there is no reason why I shouldn't be able to get good performance with just a giant regex, but this appears not to work.

Beyond that, perhaps there is some clever trick I could use relating to GUID structure to get better results.

This isn't really a crucial problem for me, but I thought I might be able to learn something.

reveazure
  • 653
  • 2
  • 7
  • 14

4 Answers4

1

I developed a method for replacing a large number of strings a while back, that may be useful:

A better way to replace many strings - obfuscation in C#

Another option would be to use a regular expression to find all GUIDs in the string, then loop through them and check if each is part of your set of GUIDs.

Basic example, using a Dictionary for fast lookup of the GUIDs:

Dictionary<string, string> guids = new Dictionary<string, string>();
guids.Add("3f74a071-54fc-10de-0476-a6b991f0be76", "(replacement)");

string text = "asdf 3f74a071-54fc-10de-0476-a6b991f0be76 lkaq2hlqwer";

text = Regex.Replace(text, @"[\da-f]{8}-[\da-f]{4}-[\da-f]{4}-[\da-f]{4}-[\da-f]{12}", m => {
  string replacement;
  if (guids.TryGetValue(m.Value, out replacement)) {
    return replacement;
  } else {
    return m.Value;
  }
});

Console.WriteLine(text);

Output:

asdf (replacement) lkaq2hlqwer
Community
  • 1
  • 1
Guffa
  • 687,336
  • 108
  • 737
  • 1,005
1

You will not get good performance with RegEx because it's performance is inherently poor. Additionally, if all the GUID's share the same format you should only need one RegEx. and regex.Replace(input, replacement); would do it.

If you have the list of guids in memory already the performance would be better by looping over that list and doing calling String.Replace like so

 foreach(string guid in guids)
     inputString.replace(guid, replacement);
evanmcdonnal
  • 46,131
  • 16
  • 104
  • 115
1

Have a look at the Rabin-Karp string search algorithm. It's well suited for multi-pattern searching in a string:

http://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_string_search_algorithm#Rabin.E2.80.93Karp_and_multiple_pattern_search

stmax
  • 6,506
  • 4
  • 28
  • 45
0

OK, this looks good. So to be clear here is the original code, which took 65s to run on the example string:

   var unusedGuids = new HashSet<Guid>(oldToNewGuid.Keys);

    foreach (var guid in oldToNewGuid) {
        var regex = guid.Key.ToString();

        if (!Regex.IsMatch(xml, regex))
             unusedGuids.Add(guid.Key);
        else
            xml = Regex.Replace(xml, regex, guid.Value.ToString());
    }

The new code is as follows and takes 6.7s:

var unusedGuids = new HashSet<Guid>(oldToNewGuid.Keys);

var guidHashes = new MultiValueDictionary<int, Guid>();

foreach (var guid in oldToNewGuid.Keys) {
    guidHashes.Add(guid.ToString().GetHashCode(), guid);
}

var indices = new List<Tuple<int, Guid>>();

const int guidLength = 36;

for (int i = 0; i < xml.Length - guidLength; i++) {
    var substring = xml.Substring(i, guidLength);

    foreach (var value in guidHashes.GetValues(substring.GetHashCode())) {
         if (value.ToString() == substring) {
        unusedGuids.Remove(value);
        indices.Add(new Tuple<int, Guid>(i, value));
        break;
         }
    }
}

var builder = new StringBuilder();

int start = 0;
for (int i = 0; i < indices.Count; i++) {
    var tuple = indices[i];
    var substring = xml.Substring(start, tuple.Item1 - start);
    builder.Append(substring);
    builder.Append(oldToNewGuid[tuple.Item2].ToString());
    start = tuple.Item1 + guidLength;
}

builder.Append(xml.Substring(start, xml.Length - start));

xml = builder.ToString();
reveazure
  • 653
  • 2
  • 7
  • 14