3

I am struggling with this problem and would really appreciate any help. I'm working on an existing project. I have added logic that counts values combinations, and makes sure we do not pass some limit. For example, given this data table columns:
Name|Age|description
The code makes sure we do not have more than K combinations of Name, Age. I have data that contains something like million pairs like this. At some point, the program just crashes, or get stuck, though I do not see any memory problem or CPU issue. I implemented this limit using ConcurrentDictionary of tuples (Name, Age) as keys, and I'm using C# .NET 6 ..
I can see that the time it takes to try and add an element to the DS becomes really long at some point.

Edit: adding some code pieces, though it is a lot of inner implementation, I do believe these are the main code parts to understand the problem:

here is the component that's responsible for limiting the keys:

    protected override Result Process(Row row)
    {
        var valueToLimit = GetValueToLimit(row);
        var result = _values.TryAdd(valueToLimit);
        }
// some logic related to the case of crossing the limit
        return Result.Success;
    }

    protected abstract T GetValueToLimit(Row row);
}

The function GetValueToLimit is implemented for my case :

protected override string[] GetValueToLimit(Row row)
{ // takes the relevant values from an input record, according to the requested columns. 
    return _columnIndices.Select(x => row.GetValue(x)).ToArray();
}

and finally, here is some parts of the concurrent HashSet implementation:

    public class BoundedConcurrentHashSet<K> : ConcurrentHashSet<K>
{
 ..
    public override Result TryAdd(K element)
    {
        if (Dictionary.Count() < _maxCapacity)
        {
            return base.TryAdd(element);
        }
        else
        {
            return Contains(element) ? Result.AlreadyInHash : Result.ExceedsCapacity;
        }
    }

where concurrentHashSet is implemented with C# concurrentDictionary:

public class ConcurrentHashSet<K>
{
    public ConcurrentHashSet(IEqualityComparer<K> equalityComparer)
    {
        Dictionary = new ConcurrentDictionary<K, object>(equalityComparer);
    }

    protected ConcurrentDictionary<K, object> Dictionary { get; }

    public int Count => Dictionary.Count;

    public IEnumerable<K> Elements => Dictionary.Keys;

    public virtual Result TryAdd(K element)
    {
        return Dictionary.TryAdd(element, null) ? dResult.Added : Result.AlreadyInHash;
    }

    public bool Contains(K element)
    {
        return Dictionary.ContainsKey(element);
    }

Please share any idea that can help.

Thanks

Nika
  • 57
  • 7
  • 2
    What is the value of this dictionary? Can you share some of the code where you interact with this dictionary? – jalepi Apr 06 '22 at 19:47
  • @jalepi the value is tuples of strings. I added the code – Nika Apr 10 '22 at 19:14
  • Somewhat related: [How to improve performance of ConcurrentDictionary.Count in C#](https://stackoverflow.com/questions/41298156/how-to-improve-performance-of-concurrentdictionary-count-in-c-sharp) – Theodor Zoulias Apr 10 '22 at 19:31
  • This question is also related: [Thread safe Collection with upper bound](https://stackoverflow.com/questions/27403530/thread-safe-collection-with-upper-bound) – Theodor Zoulias Apr 11 '22 at 14:12

2 Answers2

3

Here is your problem:

public override ConcurrentHashSetAddResult TryAdd(K element)
{
    if (Dictionary.Count() < _maxCapacity)
    {
        return base.TryAdd(element);
    }
    //...

...where Dictionary is the underlying ConcurrentDictionary<K, object> object.

The Count() is a LINQ method that either enumerates the enumerable sequence from start to end, or returns the Count property in case the sequence implements the ICollection<TSource> interface. The ConcurrentDictionary<K, V> implements this interface, so the Count property is used indeed. Here is what the documentation of this property says:

This property has snapshot semantics and represents the number of items in the ConcurrentDictionary<TKey,TValue> at the moment when the property was accessed.

The "snapshot semantics" is the important part. It means that in order to acquire the Count, the dictionary has to be completely locked temporarily. When a thread reads the Count, all other threads have to wait. No concurrency at all.

An ApproximateCount property had been proposed at some point on GitHub, but it didn't got enough traction and it's now closed. That property would allow you to implement the BoundConcurrentHashSet functionality with much reduced overhead, but also with less accurate behavior: it would be possible to exceed the _maxCapacity configuration.

My suggestion is to ditch the ConcurrentDictionary<K, object>, and use a HashSet<T> as underlying storage, protected with a lock.

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
  • Thank you so much for the help! This sounds reasonable, but I do have some questions: 1. A lock would not leave us with the same issue ? 2. I see that computation time goes up when the DS gets bigger. How is this goes along with .Count() being the issue? I would expect it to be independent on the dictionary size. – Nika Apr 10 '22 at 19:42
  • @NiaB you might want to study the [source code](https://github.com/dotnet/runtime/blob/main/src/libraries/System.Collections.Concurrent/src/System/Collections/Concurrent/ConcurrentDictionary.cs) of the `ConcurrentDictionary` class. In short the largest the dictionary becomes, the more locks it uses internally, and the more expensive the `Count` becomes. This class is designed to be fast and not contentious when adding, reading and removing specific keys. Other operations are supported for completeness, and they are not guaranteed to be fast. – Theodor Zoulias Apr 10 '22 at 19:55
  • @NiaB the `HashSet` is very fast at adding, reading and removing specific elements, and also at reporting the `Count`. So if you `lock` around these operations only, it's not very likely that you'll notice much contention, unless you perform these operations at a scale of 100,000 times per second or more. – Theodor Zoulias Apr 10 '22 at 20:00
  • @NiaB in retrospect the suitability of using a `HashSet` in a multithreaded environment depends on the cost of calculating the hashcode of the `T`. If the cost is small (for example `T` is `int`) then the suitability increases. If the cost is high (for example `T` is `string` with very long strings) then the `HashSet` might become a bottleneck. – Theodor Zoulias Apr 11 '22 at 07:37
  • the HashSet keys will be (string,string) where each string length is limited to ~16 characters. I do see a big performance improvement when moving from array of two strings with my own comparer to tuple (string, string). Do you have an idea why is that? – Nika Apr 11 '22 at 10:46
1

I have found that using normal collections and employing a lock around the place you're iterating and adding is a lot faster than using the Concurrent collections. This becomes true the more items you add to your collection.

Charles Owen
  • 2,403
  • 1
  • 14
  • 25
  • 1
    Enumerating non-thread-safe collections with a `lock` can be tricky, because enumerating a large collection can take a fair amount of time, and all other threads that want to interact with the collection during this time will be blocked. Hopefully you won't need to enumerate it very often! – Theodor Zoulias Apr 18 '22 at 18:13