0

I've a common problem where I've not found a proper solution. I've multiple XML strings with a specific tag (e.g. MIME_SOURCE) and I don't know which XML string contains which value. But I have to replace all occurrences.

On the other hand I have a dictionary containing all possible values of the XML as a key and the value to replace with as value. As I said, I don't know what to replace in which XML.

E.g.

Part of first XML

    <MIME>
        <MIME_SOURCE>\Web\Bilder Groß\1509_131_021_01.jpg</MIME_SOURCE>
    </MIME>
    <MIME>
        <MIME_SOURCE>\Web\Bilder Groß\1509_131_021_01_MitWasserzeichen.jpg</MIME_SOURCE>
    </MIME>
    <MIME>
        <MIME_SOURCE>\Web\Bilder Groß\icon_top.jpg</MIME_SOURCE>
    </MIME>

Part of second XML:

    <MIME>
        <MIME_SOURCE>\Web\Bilder klein\5478.jpg</MIME_SOURCE>
    </MIME>

Dictionary looks like:

{"\Web\Bilder Groß\1509_131_021_01.jpg", "/Web/Bilder Groß/1509_131_021_01.jpg"}

{"\Web\Bilder Groß\1509_131_021_01_MitWasserzeichen.jpg", "/Web/Bilder Groß/1509_131_021_01_MitWasserzeichen.jpg"}

{"\Web\Bilder Groß\icon_top.jpg", "icon_top.jpg"}

{"\Web\Bilder klein\5478.jpg", "5478.jpg"}

My main problem is, if I iterate through the dictionary for each XML string the effort will be count of XML strings multiplied with count of entries in the dictionary (n*m). This is really bad in my case as there can be around a million XML strings and at least thousands of entries in the dictionary.

Currently I'm using string.Replace for each key of the dictionary for each XML.

Do you have a good idea how to speed up this process?


Edit:

I've changed code to the following one:

        var regex = new Regex(@"<MIME_SOURCE>[\s\S]*?<\/MIME_SOURCE>");

        foreach (Match match in regex.Matches(stringForXml))
        {
            DoReplacements...
        }

This fits to the requirements for now as the replacement will only be done for each MIME_SOURCE in the XML. But I will as well have a look at the mentioned algorithm.

Christoph
  • 155
  • 1
  • 2
  • 10
  • 1
    I had a similar problem a while back, look at this question, might help you out: https://stackoverflow.com/questions/46339057/c-sharp-fastest-string-search-in-all-files – Century Feb 11 '20 at 12:28
  • Yeah, thanks! This seems to be a really similar problem. I will check the algorithm. – Christoph Feb 11 '20 at 12:38

2 Answers2

2

The most correct way is to properly parse your XML. Then you can go through it in a single pass:

var xml = @"<root>
    <MIME>
        <MIME_SOURCE>\Web\Bilder Groß\1509_131_021_01.jpg</MIME_SOURCE>
    </MIME>
    <MIME>
        <MIME_SOURCE>\Web\Bilder Groß\1509_131_021_01_MitWasserzeichen.jpg</MIME_SOURCE>
    </MIME>
    <MIME>
        <MIME_SOURCE>\Web\Bilder Groß\icon_top.jpg</MIME_SOURCE>
    </MIME>
</root>";

var replacements = new Dictionary<string, string>()
{
    {@"\Web\Bilder Groß\1509_131_021_01.jpg", "/Web/Bilder Groß/1509_131_021_01.jpg"},
    {@"\Web\Bilder Groß\1509_131_021_01_MitWasserzeichen.jpg", "/Web/Bilder Groß/1509_131_021_01_MitWasserzeichen.jpg"},
    {@"\Web\Bilder Groß\icon_top.jpg", "icon_top.jpg"},
    {@"\Web\Bilder klein\5478.jpg", "5478.jpg"}
};

var doc = XDocument.Parse(xml);
foreach (var source in doc.Root.Descendants("MIME_SOURCE"))
{
    if (replacements.TryGetValue(source.Value, out var replacement))
    {
        source.Value = replacement;
    }
}

var result = doc.ToString();

If you can make some assumptions about how your XML is structured (e.g. no whitespace between the <MINE_SOURCE> tags, no attributes, etc), then you can use some regex, allowing you to again make a single pass:

var result = Regex.Replace(xml, @"<MIME_SOURCE>([^<]+)</MIME_SOURCE>", match =>
{
    if (replacements.TryGetValue(match.Groups[1].Value, out var replacement))
    {
        return $"<MIME_SOURCE>{replacement}</MIME_SOURCE>";
    }
    return match.Value;
});

You'll have to benchmark different approaches yourself on your own data. Use BenchmarkDotNet.

canton7
  • 37,633
  • 3
  • 64
  • 77
  • Thanks for your suggest :) The XML is wellformed and therefor without any whitespace in the tag or something like that. Unfortunately using XDoc is restricted in my case because of to much memory using... Otherwise I would like to use this approach. – Christoph Feb 11 '20 at 12:43
  • @Christoph How big is your XML file? – canton7 Feb 11 '20 at 12:43
1

As I already mentioned in a comment above, I used to have a similar problem (see: c# Fastest string search in all files).

Using the Aho–Corasick algorithm that has been suggested to me in the accepted answer I was able to conduct a string search in fast enough time for my problem (going from a minutes execution time to merely seconds).

An implementation of said algorithm can be found here.

Here is a little sample on how to use the implementation linked above. (looking some needles in a haystack)

static bool anyViaAhoCorasick(string[] needles, string haystack)
{
    var trie = new Trie();
    trie.Add(needles);
    trie.Build();
    return trie.Find(haystack).Any();
}
Century
  • 285
  • 6
  • 20