Check if HUGE dictionary contains a string and get all elemets that match element

Question

I have two huge dictionaries, one named DictHashesSource with 2256001 lines and another dictionary named DictHashesTarget with 2061735 lines.

Dictionary<int, string> DictHashesSource = new Dictionary<int, string>();
Dictionary<int, string> DictHashesTarget = new Dictionary<int, string>();

What I want to do is, for each element of DictHashesSource retrieve all elements in DictHashesTarget that match, and do the exact same thing in the oposite way. To do so, I used LINQ like bellow:

IEnumerable<string> interceptedRowsSource = DictHashesSource.Values.Where(x => DictHashesTarget.Values.Contains(x)).ToList();
IEnumerable<string> interceptedRowsTarget = DictHashesTarget.Values.Where(x => DictHashesSource.Values.Contains(x)).ToList();

The problem is, as the two dictionaries are really big, it takes more than 1 hour to do each operation, is there any way to reduce the complexity of this algorithm?

Note: I really have to use two dictionaries because I will have to use the keys in further operations.

Another note: The same value doesnt have the same key in both dictionaries

More info please. What is DictHashesSource defined as? What is DictHashesTarget defined as? Do you need to materialize it (.ToList()) before other operations? — ProgrammingLlama, Feb 06 '20 at 12:00
Every time you call `Values` its O(1) time complexity, so each statement you have with the `Contains` is O(n2*2) — TheGeneral, Feb 06 '20 at 12:01
`Dictionary.Values` is effectively just a linear list and not even a simple one at that, since it has to walk the buckets. Create new collections with the values (like `HashSet`) and search in those. (`Enumerable.Intersect`, `.Union` and `.Except` use `Set` in the background.) — Jeroen Mostert, Feb 06 '20 at 12:07
dict1.Values.Intersect(dict2.Values); doesnt work because it doesnt retrieve the duplicates — Pugnatore, Feb 06 '20 at 12:07
Dictionary will not work for this, you need to have a tree type search like this: https://github.com/gmamaladze/trienet — Gabriel Alexandre, Feb 06 '20 at 12:08
For duplicates, you can reduce the values to tuples with a value and an occurrence count, or use `Enumerable.ToLookup`. — Jeroen Mostert, Feb 06 '20 at 12:09
@Pugnatore Check [this `Overlap()` extension method](https://stackoverflow.com/a/5012081/8967612). You can do `dict1.Values.Overlap(dict2.Values).ToList();`. Should be pretty quick. — 41686d6564 stands w. Palestine, Feb 06 '20 at 12:17
Please provides samples/explain what the `int` and `string` values are. Seems like this may be an [XY Problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). — NetMage, Feb 06 '20 at 18:41

Kiksen · Answer 1 · 2020-02-06T12:28:17.747

0

An approach could be to make an reverse dictionary. So you have more constant results. So you'r values becomes keys and vice versa.

        Dictionary<int, string> source = new Dictionary<int, string>();
        Dictionary<int, string> target = new Dictionary<int, string>();

        source.Add(1, "a");
        source.Add(2, "b");
        source.Add(3, "c");

        target.Add(4, "c");
        target.Add(5, "d");
        target.Add(6, "e");

        // Reverse index:
        var reverseSource = source.Reverse();
        var reverseTarget = target.Reverse();

        foreach (var sourceItem in reverseSource)
        {
            if (reverseTarget.ContainsKey(sourceItem.Key)){
                Console.WriteLine($"Source and Target contains {sourceItem.Key}");
            }
        }

With the following reverse dictionary function.

    public static Dictionary<TValue, TKey> Reverse<TKey, TValue>(this IDictionary<TKey, TValue> source)
    {
        var dictionary = new Dictionary<TValue, TKey>();
        foreach (var entry in source)
        {
            if (!dictionary.ContainsKey(entry.Value))
                dictionary.Add(entry.Value, entry.Key);
        }
        return dictionary;
    }

You could add all the keys as a comma seperated list if it is needed?

edited Feb 06 '20 at 12:28

answered Feb 06 '20 at 12:08

Kiksen

1,559
1
18
41

yes it makes sense, but it is still taking more than one hour to do this comparison – Pugnatore Feb 06 '20 at 12:10
I will add code snippet to help :) You still need to make 1 run through. For that data it shouldn't take a long time. 2sek – Kiksen Feb 06 '20 at 12:13
This is quadratic. There's no need for it to be quadratic: you can make it linear. – canton7 Feb 06 '20 at 12:21
Redid the entire approach to use reverse dictionaries instead. – Kiksen Feb 06 '20 at 12:28
how does this make it better? – Jonathan Alfaro Feb 06 '20 at 14:13

Yns · Answer 2 · 2020-02-07T17:07:05.887

0

You could create HashSets with values from both dictionaries.

HashSet<string> HashesSourceSet;

HashSet<string> HashesTargetSet;

Then do something like this:

var result1 = HashesSourceSet.Where(x => HashesTargetSet.Contains(x)).ToList();
var result2 = HashesTargetSet.Where(x => HashesSourceSet.Contains(x)).ToList();

This would reduce the complexity to O(n)

----------------- UPDATE --------------------

As you mentioned that you needed count of hashes, you could do as below:


Dictionary<string, int> HashesWithCount = new Dictionary<string, int>();

foreach (var x in DictHashesSource.Values)
{   
    HashesWithCount[x] = HashesWithCount.ContainsKey(x) ? (HashesWithCount [x] + 1) : 1;
}

Now you have the count of duplicate values.

edited Feb 07 '20 at 17:07

answered Feb 07 '20 at 07:50

Yns

1
2

The problem is that I have some duplicated values, and as far as I know HashSet doesnt allow duplicated values right? – Pugnatore Feb 07 '20 at 09:36
Yea, hashset will keep distinct values only. If you need their count, you could create a Dictionary with your hash values and count of them.While adding hashes to the dictionary, if the key is already present then increment the value. – Yns Feb 07 '20 at 16:39

Check if HUGE dictionary contains a string and get all elemets that match element

2 Answers2