0

I have an Item object with a property called generator_list (hashset of strings). I have 8000 objects, and for each object, I'd like to see how it's generator_list intersects with every other generator_list, and then I'd like to store the intersection number in a List<int>, which will have 8000 elements, logically.

The process takes about 8 minutes, but only a few minutes with parallel processing, but I don't think I'm doing the parallel part right, hence the question. Can anyone please tell me if and how I need to modify my code to take advantage of the parallel loops?

The code for my Item object is:

public class Item
{
    public int index { get; set; }
    public HashSet<string> generator_list = new HashSet<string>();
}

I stored all my Item objects in a List<Item> items (8000 elements). I created a method that takes in items (the list I want to compare) and 1 Item (what I want to compare to), and it's like this:

public void Relatedness2(List<Item> compare, Item compare_to)
        {
            int compare_to_length = compare_to.generator_list.Count;
            foreach (Item block in compare)
            {
                int block_length = block.generator_list.Count;
                int both = 0; //this counts the intersection number
                if (compare_to_length < block_length) //to make sure I'm looping  
                                                      //over the smaller set
                {
                    foreach (string word in compare_to.generator_list)
                    {
                        if (block.generator_list.Contains(word))
                        {
                            both = both + 1;
                        }
                    }
                }
                else
                {
                    foreach (string word in block.generator_list)
                    {
                        if (compare_to.generator_list.Contains(word))
                        {
                            both = both + 1;
                        }
                    }
                }
                     // I'd like to store the intersection number, both,   
                     // somewhere so I can effectively use parallel loops
            }

        }

And finally, my Parallel forloop is:

Parallel.ForEach(items, (kk, state, index) => Relatedness2(items, kk));

Any suggestions?

thatandrey
  • 287
  • 5
  • 16
  • What happens when you run it? – Chuck Buford Sep 16 '14 at 19:29
  • You should look into using a concurrent collection object, such as this one [ConcurrentDictionary(Of TKey, TValue)](http://msdn.microsoft.com/en-us/library/dd287191(v=vs.110).aspx), unless you want to implement one yourself. Also see: [Concurrent HashSet in .NET Framework](http://stackoverflow.com/questions/18922985/concurrent-hashsett-in-net-framework). – Victor Zakharov Sep 16 '14 at 20:16
  • Hi, when I actually run it, it works, and I used to store a List results for each Item in a property for that item – thatandrey Sep 16 '14 at 20:38
  • Sorry, I edited a line in the parallel code because I had to modify my code before I pasted it here, to make it more readable – thatandrey Sep 16 '14 at 20:43
  • In other words, the parallel code line must've looked weird because I was using a variable that I didn't mention elsewhere, but that was a result of just copying and pasting my code and trying to clean it up, I fixed it now – thatandrey Sep 16 '14 at 20:49

3 Answers3

2

Maybe something like this

 public Dictionary<int, int> Relatedness2(IList<Item> compare, Item compare_to)
        {
            int compare_to_length = compare_to.generator_list.Count;
            var intersectionData = new Dictionary<int, int>();
            foreach (Item block in compare)
            {
                int block_length = block.generator_list.Count;
                int both = 0;
                if (compare_to_length < block_length)
                {
                    foreach (string word in compare_to.generator_list)
                    {
                        if (block.generator_list.Contains(word))
                        {
                            both = both + 1;
                        }
                    }
                }
                else
                {
                    foreach (string word in block.generator_list)
                    {
                        if (compare_to.generator_list.Contains(word))
                        {
                            both = both + 1;
                        }
                    }
                }
                intersectionData[block.index] = both;
            }
            return intersectionData;
        }

And

          List<Item> items = new List<Item>(8000);
        //add to list
        var dictionary = new ConcurrentDictionary<int, Dictionary<int, int>>();//thread-safe dictionary

        var readOnlyItems = items.AsReadOnly();// if you sure you wouldn't modify collection, feel free use items directly
        Parallel.ForEach(readOnlyItems, item =>
        {
            dictionary[item.index] = Relatedness2(readOnlyItems, item);
        });

I assumed that index unique.

i used a dictionaries, but you may want to use your own classes in my example you can access data in following manner

var intesectiondata = dictionary[1]//dictionary of intersection for item with index 1

var countOfintersectionItemIndex1AndItemIndex3 = dictionary[1][3]
var countOfintersectionItemIndex3AndItemIndex7 = dictionary[3][7]

don't forget about possibility dictionary[i] == null

Roman G.
  • 182
  • 12
  • Thanks, I'll try this. My only question, is that for each Item, since it's being intersected with every other item, I'd like to store all the results in one List for each item. So it'll look like 7, {23,34,0,0,45,100....} and that would mean that Item 7 (based on index, which is unique), has an intersection of 23 with Item 0, 34 with Item 1,... and so on. Do you know how I can change your code to implement this? – thatandrey Sep 16 '14 at 20:40
  • Oh wait, I just ran the code, and it was BLAZING fast. I'm going to try to confirm the results, and I think your dictionary actually does give me the values I want, I was just misunderstanding your code. Sorry, I only started programming this summer! – thatandrey Sep 16 '14 at 20:47
  • Do you think you could update your answer and add this code? I can't seem to follow it that well in the comments :( – thatandrey Sep 16 '14 at 20:52
  • Actually, I had one more question. I tried this on a subset of my Items (with just 1000 Items), and it works, but the Dictionary is only giving me the first 1000 intersections? Before, with each iteration of Relatedness2, it was finding the intersection of an Item with the rest of the items, but now it seems to only be doing it for the first 1000? **** actually I think it was just me defining my list incorrectly, and I was passing the shorter list as the one to compare all the items to -_- – thatandrey Sep 16 '14 at 21:03
  • So, since, I can't run all 8000 (actually more like 8700) Item calculations at the same time without a memory exception, do you know how I can modify the code to just do 1000 at a time? – thatandrey Sep 16 '14 at 21:07
  • 1
    Parallel.ForEach(readOnlyItems.Take(1000).ToList(), item => { dictionary[item.index] = Relatedness2(readOnlyItems, item); }); this will get you All intersections for first 1000 items of list – Roman G. Sep 16 '14 at 21:19
  • Thanks, but for some reason, this actually takes the same amount of time as before :/ do you know why? That seems weird, since previously, I was accessing and modifying my items too... hmm – thatandrey Sep 16 '14 at 21:21
  • there may be many reasons... maybe simply less available processor time. And make sure you don't modify items in Relatedness2 method, that maybe make results incorrect. – Roman G. Sep 16 '14 at 21:29
0

Thread safe collections is probably what you are looking for http://msdn.microsoft.com/en-us/library/dd997305(v=vs.110).aspx.

When working in multithreaded environment, you need to make sure that you are not manipulating shared data at the same time without synchronizing access.

the .NET Framework offers some collection classes that are created specifically for use in concurrent environments, which is what you have when you're using multithreading. These collections are thread-safe, which means that they internally use synchronization to make sure that they can be accessed by multiple threads at the same time.

Source: Programming in C# Exam Ref 70-483, Objective 1.1: Implement multhitreading and asynchronous processing, Using Concurrent collections

Which are the following collections

  • BlockingCollection<T>
  • ConcurrentBag<T>
  • ConcurrentDictionary<T>
  • ConcurentQueue<T>
  • ConcurentStack<T>
Ken Spur
  • 41
  • 3
  • 1
    Thanks, but since I only started programming this summer, this kind of information flies over my head lol. But I think I get the point, especially since one of the answers used a concurrent Dictionary, and it's beginning to shed light on the whole topic. – thatandrey Sep 16 '14 at 20:53
0

If your Item's index is contiguous and starts at 0, you don't need the Item class at all. Just use a List< HashSet< < string>>, it'll take care of indexes for you. This solution finds the intersect count between 1 item and the others in a parallel LINQ. It then takes that and runs it on all items of your collection in another parallel LINQ. Like so

var items = new List<HashSet<string>>
{
    new HashSet<string> {"1", "2"},
    new HashSet<string> {"2", "3"},
    new HashSet<string> {"3", "4"},
    new HashSet<string>{"1", "4"}
};


var intersects = items.AsParallel().Select(     //Outer loop to run on all items
    item => items.AsParallel().Select(          //Inner loop to calculate intersects
            item2 => item.Intersect(item2).Count())
            //This ToList will create a single List<int>
            //with the intersects for that item
            .ToList() 
        //This ToList will create the final List<List<int>>
        //that contains all intersects.
        ).ToList();
Guillaume CR
  • 3,006
  • 1
  • 19
  • 31
  • Thanks, I'll definitely try this out too! I only had a chance to try the other answer's code and I have classes all day ugh, but once I'm free, I'll try implementing this as well. – thatandrey Sep 16 '14 at 20:50
  • Hi, I'm curious, what's the point of `newItem` ? I understand items, and I'd like to go ahead and try this, but I don't have the equivalent of `newItem` in my code? – thatandrey Sep 16 '14 at 21:38
  • @user3408097 Maybe I'm misunderstanding your code. I thought you wanted to compare one item to your list of items and get the number of intersects for each existing item in the list compared to that new item. Looks like you already found your answer though. – Guillaume CR Sep 17 '14 at 02:14
  • Yeah, I sort of did, and when I ran it with a subset of the data, it was super fast, but with all the data, it's just as fast as what I had before... Yeah, now that I see your code better, I sort of understand, but can you please show me how to change it so it'll automatically loop through all the Items in the list? It seems like your code is only doing it for one Item? – thatandrey Sep 17 '14 at 04:46
  • @user3408097 Updated the answer, let me know about the performance. – Guillaume CR Sep 17 '14 at 13:58
  • With your method, it seemed much slower. On the outer loop, I did a subset of 500 items and it took more than a minute. When I replaced the intersect function with the method I had in my original post, it dropped to 10 seconds, which is about the time I had, and the time the other answer had too :/ – thatandrey Sep 17 '14 at 18:10
  • @user3408097 That is disappointing to hear. I'm wondering if I overused .AsParrallel(). Maybe take it out of the inner loop? Or maybe LINQ just isn't very fast, which would be unfortunate. – Guillaume CR Sep 17 '14 at 18:12