0

I have a List<Keyword> where Keyword class is:

public string keyword;
public List<int> ids;
public int hidden;
public int live;
public bool worked;

Keyword has its own keyword, a set of 20 ids, live by default is set to 1 and hidden to 0.

I just need to iterate over the whole main List to invalidate those keywords whose number of same ids is greater than 6, so comparing every pair, if the second one has more than 6 ids repeated respect to the first one, hidden is set to 1 and live to 0.

The algorithm is very basic but it takes too long when the main list has many elements.

I'm trying to guess if there could be any method I could use to increase the speed.

The basic algorithm I use is:

foreach (Keyword main_keyword in lista_de_keywords_live)
{
    if (main_keyword.worked) {
        continue;
    }
    foreach (Keyword keyword_to_compare in lista_de_keywords_live)
    {
        if (keyword_to_compare.worked || keyword_to_compare.id == main_keyword.id) continue;

        n_ids_same = 0;
        foreach (int id in main_keyword.ids)
        {
            if (keyword_to_compare._lista_models.IndexOf(id) >= 0)
            {
                if (++n_ids_same >= 6) break;
            }
        }

        if (n_ids_same >= 6)
        {
            keyword_to_compare.hidden = 1;
            keyword_to_compare.live   = 0;
            keyword_to_compare.worked = true;
        }
    }
}
Komal12
  • 3,340
  • 4
  • 16
  • 25
Apalabrados
  • 1,098
  • 8
  • 21
  • 38
  • Tried using a [set](https://msdn.microsoft.com/en-us/library/bb359438(v=vs.110).aspx) instead? They're perfect for this sort of processing. If you don't want to change the container type in your class but you've got memory to spare, it might still be more efficient to just bung all the elements of the lists into sets and do the algorithm with those. – hnefatl Oct 22 '17 at 23:16
  • I've never used sets. Could please show me an example? – Apalabrados Oct 22 '17 at 23:19
  • Look at the "Remarks" section right at the bottom of the linked page - there's an example along with a description of some features that are relevant. – hnefatl Oct 22 '17 at 23:34
  • 1
    I think n_ids_same can never be >= 8. – Antonín Lejsek Oct 22 '17 at 23:34
  • @AntonínLejsek It's just a mistake when writing the query. In my code it's 6. – Apalabrados Oct 22 '17 at 23:39

2 Answers2

2

The code below is an example of how you would use a HashSet for your problem. However, I would not recommend using it in this scenario. On the other hand, the idea of sorting the ids to make the comparison faster still. Run it in a Console Project to try it out.

Notice that once I'm done adding new ids to a keyword, I sort them. This makes the comparison faster later on.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;

namespace KeywordExample
{

    public class Keyword
    {
        public List<int> ids;
        public int hidden;
        public int live;
        public bool worked;

        public Keyword()
        {
            ids = new List<int>();
            hidden = 0;
            live = 1;
            worked = false;
        }

        public override string ToString()
        {
            StringBuilder s = new StringBuilder();
            if (ids.Count > 0)
            {
                s.Append(ids[0]);
                for (int i = 1; i < ids.Count; i++)
                {
                    s.Append(',' + ids[i].ToString());
                }
            }
            return s.ToString();
        }

    }

    public class KeywordComparer : EqualityComparer<Keyword>
    {
        public override bool Equals(Keyword k1, Keyword k2)
        {
            int equals = 0;
            int i = 0;
            int j = 0;

            //based on sorted ids
            while (i < k1.ids.Count && j < k2.ids.Count)
            {
                if (k1.ids[i] < k2.ids[j])
                {
                    i++;
                }
                else if (k1.ids[i] > k2.ids[j])
                {
                    j++;
                }
                else
                {
                    equals++;
                    i++;
                    j++;
                }
            }

            return equals >= 6;
        }
        public override int GetHashCode(Keyword keyword)
        {
            return 0;//notice that using the same hash for all keywords gives you an O(n^2) time complexity though.
        }
    }


    class Program
    {

        static void Main(string[] args)
        {
            List<Keyword> listOfKeywordsLive = new List<Keyword>();
            //add some values
            Random random = new Random();
            int n = 10;
            int sizeOfMaxId = 20;
            for (int i = 0; i < n; i++)
            {
                var newKeyword = new Keyword();
                for (int j = 0; j < 20; j++)
                {
                    newKeyword.ids.Add(random.Next(sizeOfMaxId) + 1);
                }
                newKeyword.ids.Sort(); //sorting the ids
                listOfKeywordsLive.Add(newKeyword);
            }

            //solution here
            HashSet<Keyword> set = new HashSet<Keyword>(new KeywordComparer());
            set.Add(listOfKeywordsLive[0]);
            for (int i = 1; i < listOfKeywordsLive.Count; i++)
            {
                Keyword keywordToCompare = listOfKeywordsLive[i];
                if (!set.Add(keywordToCompare))
                {
                    keywordToCompare.hidden = 1;
                    keywordToCompare.live = 0;
                    keywordToCompare.worked = true;
                }
            }

            //print all keywords to check
            Console.WriteLine(set.Count + "/" + n + " inserted");
            foreach (var keyword in set)
            {
                Console.WriteLine(keyword);
            }

        }

    }
}
Raudel Ravelo
  • 648
  • 2
  • 6
  • 24
  • If two things are equal (Equals(...) == true) then they must return the same value for GetHashCode(). base.GetHashCode() is not correct here. – Antonín Lejsek Oct 23 '17 at 00:46
  • 1
    @AntonínLejsek, Notice that I'm using base.GetHashCode() which gives you the hashCode of the Keyword Comparer Class. Generally, I would not do that, but in this case I'm doing it to make all hashes equals by default. The cons is that the HashSet behavior turns O(n^2) for n insertions when doing this. Same Time Complexity though. – Raudel Ravelo Oct 23 '17 at 00:53
  • @AntonínLejsek, this is just an example for him (@Apalabrados) to know how to use a HashSet. I would use your solution ;) – Raudel Ravelo Oct 23 '17 at 00:56
  • 1
    Now I see. As long as you work with the same instance of the EqualityComparer you would get the same value for the hash code. In that case it is correct here, but I think just `return 0` would be much easier and more explicit. – Antonín Lejsek Oct 23 '17 at 01:26
  • You are right! I just edited the code and voted your answer up too. – Raudel Ravelo Oct 23 '17 at 01:30
  • Returning a constant value from `GetHashCode` is **never** a good idea. If you cannot come up with a good value to return based on the item you are comparing in your equality comparer, then that’s a clear sign that you shouldn’t be doing this in the first place. In this case, the result of the hash set depends on the *order* of how the items are inserted. This makes this a **very bad fit** for hash sets, not to mention that this solution results in a terrible performance. Please do not suggest people to use something like this! – poke Nov 19 '17 at 15:14
  • 2
    Even aside from the hash code, this is a bad equality comparer because it fails the transitivity test. Consider three Keyword objects, A, B and C. If A has 6 ids, C has 6 different ones, and B has all the ids from both A and B, then we get comparer.Equals(A, B) returns true; comparer.Equals(B, C) returns true, but comparer.Equals(A, C) returns false. That violates the [general equality contract](https://learn.microsoft.com/en-gb/dotnet/api/system.collections.generic.iequalitycomparer-1.equals?view=netstandard-1.6). – Jon Skeet Nov 19 '17 at 17:57
  • @poke Yes, I do not recommend a HashSet for this problem, I think I was answering a comment he made about using a HashSet for this scenario. The O(n^2) stills. So, yes, I would say the code looks like an O(nlogn) and it's confusing, specially when you forget what you did with the GetHashCode – Raudel Ravelo Nov 20 '17 at 19:39
  • @JonSkeet what would you recommend in this case? I just think that the solution would be not to use an equality comparer at all and not to use a Hash Set. Just doing an inner loop calling a function like the one I coded as the equality comparer will do the job. – Raudel Ravelo Nov 20 '17 at 19:41
  • If you don’t recommend a hash set, you should make that clear in your answer. Unfortunately, this has led OP to believe that this is the correct attempt and spawned a [follow-up question](https://stackoverflow.com/questions/47377736/c-sharp-hashset-init-takes-too-long) that was heavily discussed. In the end it seems that OP is now unfortunately convinced of using a hash set for this problem when it’s really the wrong tool… :/ – poke Nov 20 '17 at 19:54
1

The obvious source of inefficiency is the way you calculate intersection of two lists (of ids). The algorithm is O(n^2). This is by the way problem that relational databases solve for every join and your approach would be called loop join. The main efficient strategies are hash join and merge join. For your scenario the latter approach may be better I guess, but you can also try HashSets if you like.

The second source of inefficiency is repeating everything twice. As (a join b) is equal to (b join a), you do not need two cycles over the whole List<Keyword>. Actually, you only need to loop over the non duplicate ones.

Using some code from here, you can write the algorithm like:

Parallel.ForEach(list, k => k.ids.Sort());

List<Keyword> result = new List<Keyword>();

foreach (var k in list)
{
    if (result.Any(r => r.ids.IntersectSorted(k.ids, Comparer<int>.Default)
                             .Skip(5)
                             .Any()))
    {
        k.hidden = 1;
        k.live = 0;
        k.worked = true;
    }
    else
    {
        result.Add(k);
    }
}

If you replace the linq with just the index manipulation approach (see the link above), it would be a tiny bit faster I guess.

Antonín Lejsek
  • 6,003
  • 2
  • 16
  • 18