1

I can use the dictionary to find the frequency of an element in a string array. But when the number of elements in an array reaches around 50 million, the RAM usage for my program is around 8-9 GB. This is way too high compared to what I expected. My dictionary is a Dictionary<string, int>, 50 million key-value pairs (if there is no duplicate key) will only cost me around 2 - 2.5GB. I wonder where I was wrong.

The part to get the frequency of elements:

public IEnumerable<string> GetTopTenStrings(string path)
{
    // Dictionary to store the result
    Dictionary<string, int> result = new Dictionary<string, int>();

    var txtFiles = Directory.EnumerateFiles(Path, "*.dat");

    int i = 1;

    foreach (string currentFile in txtFiles)
    {
        using (FileStream fs = File.Open(currentFile, FileMode.Open,
            FileAccess.Read, FileShare.Read))
        using (BufferedStream bs = new BufferedStream(fs))
        using (StreamReader buffer = new StreamReader(bs))
        {
            Console.WriteLine("Now we are at the file {0}", i);

            // Store processed lines                        
            string storage = buffer.ReadToEnd();

            Process(result, storage);  

            i++;
        }
    }

    // Sort the dictionary and return the needed values
    var final = result.OrderByDescending(x => x.Value).Take(10);

    foreach (var element in final)
    {
        Console.WriteLine("{0} appears {1}", element.Key, element.Value);
    }

    var test = final.Select(x => x.Key).ToList();

    Console.WriteLine('\n');

    return test;
}

The function to add key value to the dictionary:

public void Process(Dictionary<string, int> dict, string storage)
{
    List<string>lines = new List<string>();

    string[] line = storage.Split(";");

    foreach (var item in line.ToList())
    {
        if(item.Trim().Length != 0)
        {
            if (dict.ContainsKey(item.ToLower()))
            {
                dict[item.ToLower()]++;
            }
            else
            {
                dict.Add(item.ToLower(), 1);
            }
        }
    }
}
supnep
  • 45
  • 6
  • 1
    You realize that each of the `Split`, `Trim` and `ToLower` creates another string? So you would expect to have 5 copies of your strings floating around. Start with stopping using `ToLower` in the favor of a dictionary with a case-insensitive comparer. – GSerg Jun 11 '23 at 15:41
  • I just remove the ToLower() and ram usage reduce a bit. But about case sensitivity, could you explain more about it? Since the result I get after removing ToLower() is different from the old one. – supnep Jun 11 '23 at 15:59
  • If you start putting the words "C# dictionary case" into Google, it will immediately suggest "c# dictionary case insensitive string key". If you agree to that query, the very first result on the page will be https://stackoverflow.com/q/13988643/11683. – GSerg Jun 11 '23 at 16:02
  • Did you debug your code, using, say 2 input files with max. 5 lines (or so...) ? how many items where in the dictionary when you tried that? Did a slight variation in the input lead to a larger, or smaller Dictionary ? (You did not provide data, so I cannot test myself... ) – Luuk Jun 11 '23 at 16:30
  • @Luuk I did some tests As long as I don't do push value in the dictionary, the ram usage is only 10-16mb With 1 file (49990 lines - each line has 10 elements separated by ' ; ', each element has 10 characters, and ram usage is around 300 - 400mb. And I has ten files like that. – supnep Jun 11 '23 at 16:43
  • 2
    c# runtime will not try to reduce the heap occupancy unless it considers the system to be under heavy memory load. If you want to see what the actual usage is force a garbage collection, like here https://stackoverflow.com/questions/17020623/implicit-garbage-collection-doesnt-reduce-memory-footprint-gc-collect-does – pm100 Jun 11 '23 at 17:30
  • 1
    I stopped reading at "reaches around 50 million". Almost regardless of your specific conditions, having a collection with so many elements seems in itself highly improvable, mainly if that collection is a dictionary, which is particularly inefficient. Personally, I rarely go above 5K for simple 1D arrays! But they are way much more efficient and consume much less memory than dictionaries. There are many ways to avoid such a wasteful approach, by starting with a proper usage of available resources. Memory is for quick access to relatively small data sets. If you need to go further, use a DB. – varocarbas Jun 12 '23 at 04:30
  • How many are the unique words contained in all the files? Do you have a rough estimation of their number? The optimal solution to your problem might depend on the nature of the data. If the words are mostly unique (most words appear just one time), you'll end up with a huge dictionary. If the words are mostly duplicates (each word appears hundreds of times), the dictionary will be kept relatively small and manageable. – Theodor Zoulias Jun 12 '23 at 06:59
  • 1
    Related: [Given a file, find the ten most frequently occurring words as efficiently as possible](https://stackoverflow.com/questions/4495241/given-a-file-find-the-ten-most-frequently-occurring-words-as-efficiently-as-pos) (algorithm, language agnostic). – Theodor Zoulias Jun 12 '23 at 07:19

1 Answers1

3

There are significant performance and memory improvements possible here:

  • We can improve memory usage by holding the whole file in one big string, and using ReadOnlyMemory<char> to hold references to it.
  • Pre-initialize the dictionary to some large size, so that resizes are not necessary.
  • Instead of .Trim() use string.IsNullOrWhiteSpace or similar.
  • Instead of .ToLower() use a case-insensitive comparer on the dictionary.
    If you were using strings you can use StringComparer, but we need a custom comparer for ReadOnlyMemory<char>.
  • Don't use .ToList() unnecessarily.
  • BufferedStream seems unnecessary.
  • Close the file before processing the string.
  • We can use CollectionsMarshal.GetValueRefOrAddDefault to avoid a double lookup on the dictionary.
  • Instead of using OrderByDescending.Take, which would require sorting the entire list, we can use an insertion sort on a fixed sized list.
    • We just binary-search each value's location, and then insert at the new location.
    • Binary search returns the two's-complement (negation) of the index if the value isn't found.

This all assumes that the number of duplicate strings isn't very high. If the number of duplicates is high then it would make more sense to split each string and Intern the results using a StringPool.

public IEnumerable<string> GetTopTenStrings(string path)
{
    // Dictionary to store the result
    var result = new Dictionary<ReadOnlyMemory<char>, int>(someLargeCapacityHere, new MemoryComparer());

    int i = 1;

    foreach (string currentFile in Directory.EnumerateFiles(Path, "*.dat"))
    {
        string str;
        using (FileStream fs = File.Open(currentFile, FileMode.Open,
            FileAccess.Read, FileShare.Read))
        using (StreamReader sr = new StreamReader(fs))
        {
            Console.WriteLine($"Now we are at the file {i}: {currentFile}");
            str = sr.ReadToEnd();

            i++;
        }
        Process(result, str);
    }

    // Sort the dictionary and return the needed values
    var final = TopByOrderDesc(result, 10);

    foreach (var element in final)
    {
        Console.WriteLine("{0} appears {1}", element.Key, element.Value);
    }

    var test = final.Select(x => x.Key.ToString()).ToList();

    Console.WriteLine('\n');

    return test;
}
public void Process(Dictionary<ReadOnlyMemory<char>, int> dict, string str)
{
    var startIndex = 0;
    while (true)
    {
        // search for separator
        var endIndex = str.IndexOf(';', startIndex);
        if (endIndex <= 0)   // not found
            endIndex = str.Length;   // go til the end of string
        if (endIndex - startIndex > 0)    // if non-zero
        {
            var mem = str.AsMemory(startIndex, endIndex - startIndex);
            if (!MemoryExtensions.IsWhiteSpace(mem.Span))    // and not whitespace
            {
                ref var val = ref CollectionsMarshal.GetValueRefOrAddDefault(dict, mem, out _);  // get ref of KVP location in dictionary
                val++;    // increment location by 1
            }
        }
        if (endIndex == str.Length)    // finished string
            break;
        startIndex = endIndex + 1;    // otherwise move to next char
    }
}
public List<KeyValuePair<ReadOnlyMemory<char>, int>> TopByOrderDesc(Dictionary<ReadOnlyMemory<char>, int> source, int top)
{
    var list = new List<KeyValuePair<ReadOnlyMemory<char>, int>>(top + 1);  //pre-initialize
    var comparer = Comparer<KeyValuePair<ReadOnlyMemory<char>, int>>.Create(
        (kvp1, kvp2) => kvp2.Value.CompareTo(kvp1.Value)
    );    // !!! Reverse comparer !!!

    foreach (var item in source)
    {
        var index = list.BinarySearch(item, comparer);
        if (index < 0)
            index = ~index;

        if (index < top) // no point inserting last one
        {
            if (list.Count == top)
                list.RemoveAt(top - 1);

            list.InsertAt(index, item);
        }
    }
    return list;
}
class MemoryComparer : IEqualityComparer<ReadOnlyMemory<char>>
{
    public StringComparison Comparison {get; set;} = StringComparison.OrdinalIgnoreCase;

    public bool Equals(ReadOnlyMemory<char> a, ReadOnlyMemory<char> b) =>
        MemoryExtensions.Equals(b.Span, a.Span, Comparison);

    public int GetHashCode(ReadOnlyMemory<char> o) =>
        string.GetHashCode(o.Span, Comparison);
}
Charlieface
  • 52,284
  • 6
  • 19
  • 43
  • 1
    Nice answer! For other implementations of the `TopByOrderDesc` extension method you can look at this question: [Extract the k maximum elements of a list](https://stackoverflow.com/questions/15089373/extract-the-k-maximum-elements-of-a-list). – Theodor Zoulias Jun 12 '23 at 05:56
  • On second thought, holding the text of all files in memory might not be a good idea. In case the number of files is large, this is not going to scale well. When the physical RAM is exhausted the operating system will start using paged memory (the disk), resulting most likely in huge performance degradation. – Theodor Zoulias Jun 12 '23 at 06:44
  • 1
    I reckoned the number of duplicates wasn't very high. If the number of duplicates is high then it makes sense to split each string and `Intern` the results using a `StringPool` – Charlieface Jun 12 '23 at 09:28
  • you mind explaining about the logic in the Process function? Its getting the error Specified argument was out of the range of valid values. (Parameter 'start') since the first time in the for loop, index - lastIndex - 1 is 0 - 0 - 1, it is return a negative value – supnep Jun 13 '23 at 16:51
  • OK changed the algorithm a bit, and put in comments. It's just an alogrithm to split the string by iterating the `;` using `IndexOf` – Charlieface Jun 13 '23 at 19:55
  • Instead of splitting the string manually, you could consider using the lightweight [`Regex.EnumerateMatches`](https://learn.microsoft.com/en-us/dotnet/api/system.text.regularexpressions.regex.enumeratematches) method. – Theodor Zoulias Jun 13 '23 at 22:45
  • 1
    @TheodorZoulias From .NET 8 https://learn.microsoft.com/en-us/dotnet/api/system.memoryextensions.split?view=net-8.0#system-memoryextensions-split(system-readonlyspan((system-char))-system-span((system-range))-system-char-system-stringsplitoptions) – Charlieface Jun 13 '23 at 23:11