1

I have two huge dictionaries, one named DictHashesSource with 2256001 lines and another dictionary named DictHashesTarget with 2061735 lines.

Dictionary<int, string> DictHashesSource = new Dictionary<int, string>();
Dictionary<int, string> DictHashesTarget = new Dictionary<int, string>();

What I want to do is, for each element of DictHashesSource retrieve all elements in DictHashesTarget that match, and do the exact same thing in the oposite way. To do so, I used LINQ like bellow:

IEnumerable<string> interceptedRowsSource = DictHashesSource.Values.Where(x => DictHashesTarget.Values.Contains(x)).ToList();
IEnumerable<string> interceptedRowsTarget = DictHashesTarget.Values.Where(x => DictHashesSource.Values.Contains(x)).ToList();

The problem is, as the two dictionaries are really big, it takes more than 1 hour to do each operation, is there any way to reduce the complexity of this algorithm?

Note: I really have to use two dictionaries because I will have to use the keys in further operations.

Another note: The same value doesnt have the same key in both dictionaries

Pugnatore
  • 395
  • 3
  • 19

2 Answers2

0

An approach could be to make an reverse dictionary. So you have more constant results. So you'r values becomes keys and vice versa.

        Dictionary<int, string> source = new Dictionary<int, string>();
        Dictionary<int, string> target = new Dictionary<int, string>();

        source.Add(1, "a");
        source.Add(2, "b");
        source.Add(3, "c");

        target.Add(4, "c");
        target.Add(5, "d");
        target.Add(6, "e");

        // Reverse index:
        var reverseSource = source.Reverse();
        var reverseTarget = target.Reverse();

        foreach (var sourceItem in reverseSource)
        {
            if (reverseTarget.ContainsKey(sourceItem.Key)){
                Console.WriteLine($"Source and Target contains {sourceItem.Key}");
            }
        }

With the following reverse dictionary function.

    public static Dictionary<TValue, TKey> Reverse<TKey, TValue>(this IDictionary<TKey, TValue> source)
    {
        var dictionary = new Dictionary<TValue, TKey>();
        foreach (var entry in source)
        {
            if (!dictionary.ContainsKey(entry.Value))
                dictionary.Add(entry.Value, entry.Key);
        }
        return dictionary;
    }

You could add all the keys as a comma seperated list if it is needed?

Kiksen
  • 1,559
  • 1
  • 18
  • 41
0

You could create HashSets with values from both dictionaries.

HashSet<string> HashesSourceSet;

HashSet<string> HashesTargetSet;

Then do something like this:

var result1 = HashesSourceSet.Where(x => HashesTargetSet.Contains(x)).ToList();
var result2 = HashesTargetSet.Where(x => HashesSourceSet.Contains(x)).ToList();

This would reduce the complexity to O(n)

----------------- UPDATE --------------------

As you mentioned that you needed count of hashes, you could do as below:


Dictionary<string, int> HashesWithCount = new Dictionary<string, int>();

foreach (var x in DictHashesSource.Values)
{   
    HashesWithCount[x] = HashesWithCount.ContainsKey(x) ? (HashesWithCount [x] + 1) : 1;
}


Now you have the count of duplicate values.

Yns
  • 1
  • 2
  • The problem is that I have some duplicated values, and as far as I know HashSet doesnt allow duplicated values right? – Pugnatore Feb 07 '20 at 09:36
  • Yea, hashset will keep distinct values only. If you need their count, you could create a Dictionary with your hash values and count of them.While adding hashes to the dictionary, if the key is already present then increment the value. – Yns Feb 07 '20 at 16:39