3

I have a class that stores values and detects if that set of values are distinct.

public class TextRecords
{
    public TextRecords()
    {
        Count = 0;
        TextInstanceDictionary = new Dictionary<string, int>();
    }

    public int Count
    {
        get;
        set;
    }

    public Dictionary<string, int> TextInstanceDictionary
    {
        get;
        set;
    }

    public void AddOrUpdateTextInstanceDictionary(string theText)
    {
        if (!TextInstanceDictionary.ContainsKey(theText))
        {
            TextInstanceDictionary.Add(theText, 1);
        }
        else
        {
            TextInstanceDictionary[theText] += 1;
        }
    }

    public bool AllValuesAreDistinct
    {
        get
        {
            return !TextInstanceDictionary.Any(kv => kv.Value > 1);
        }
    }
}

This works fine for small sets of values but is not scale in terms of memory usage and performance.

Is there a way to detect if a set of values are unique without storing them all in memory as I am doing in the approach above?

I am looking for reasonable small memory footprint while retaining a good level of speed.

I am aware of Bloom filters and read this answer. Is there any other methods to solve this very specific problem?

(Note: I have also reviewed this answer but this is a different problem. I am streaming in the values one by one so just need to know if I have or haven't seen that value before. The other answer is where you are presented with a fully populated collection and asked if the values are distinct).

Mark Robinson
  • 13,128
  • 13
  • 63
  • 81
  • 2
    This is the element distinctness problem, a well studied problem with known theoretical boundaries. Thread marked as dupe discusses the theoretical boundaries and several solutions to this problem. – amit Jun 19 '20 at 06:23
  • "Is there a way to detect if a set of values are unique without storing them all in memory as I am doing in the approach above?" How are you storing them right now? Is it a text document on disk or what? – MindSwipe Jun 19 '20 at 06:29
  • 2
    You _could_ populate 2 buffer lists/dict/whatever (sorted!) and match the values one-by-one and "remove" them from both when the same value has occured, That could save you lots of memory, if the values are _approximately_ streaming in the same order. If they are not the same, or in reverse order, you will end up saving maybe half the memory only, and with a high computation cost, though. – Pac0 Jun 19 '20 at 06:38
  • Your values are string. Do these strings have particular characteristics, which would allow to compress them? – Damien Jun 19 '20 at 08:27
  • No @Damien. I'm afraid I know nothing about the input source. My program tries to analyse an anonymous data source. – Mark Robinson Jun 19 '20 at 12:07
  • 1
    If the values are large, you could store a hash of the value instead of the actual value, to save memory. But you need to store something somewhere (disk could be used to minimize memory use at the cost of orders of magnitude of speed) to understand whether you've seen {thing} before. – TristanK Jun 19 '20 at 13:56

0 Answers0