-1

i am differentiating 1 million list of object with another 1 million list of object. i am using for , foreach but it takes too much of time to iterate those list. can any one help me best way to do this

var SourceList = new List<object>(); //one million 
var TargetList = new List<object>()); // one million

//getting data from database here
//SourceList with List of one million 
//TargetList with List of one million

var DifferentList = new List<object>();

//ForEach
SourceList.ToList().ForEach(m =>
    {
      if (!TargetList.Any(s => s.Name == m.Name))
            DifferentList.Add(m);
  });

 //for
 for (int i = 0; i < SourceList .Count; i++)
   {
    if (!TargetList .Any(s => s == SourceList [i].Name))
            DifferentList .Add(SourceList [i]);
  }



Sarvesh
  • 37
  • 6
  • 8
    Post some code. It would be helpful. – user1850484 Jan 09 '20 at 11:11
  • First of all, it is not a question of C# but a general problem. Depending on your structure you may use sorted lists or binary trees... – Ugur Jan 09 '20 at 11:12
  • What is the basis for comparing objects? – Noneme Jan 09 '20 at 11:15
  • can you split the list in smaller chunks and run the your action in parallel? – Burim Hajrizaj Jan 09 '20 at 11:21
  • you say "too much time" but how much time are we actually talking about here? You really need to provide an example of your comparison logic as well, in cases like this it is usually the comparison that is the main bottleneck, or the way that you retrieve the values from the list itself, not all _lists_ are equal. – Chris Schaller Jan 09 '20 at 11:25
  • 3
    One advice when asking questions on Stack Overflow: be responsive, be involved! If necessary, respond to answers you receive so answerers know they're on the right track, and *always* respond to questions or suggestions in comments. – Gert Arnold Jan 13 '20 at 15:39

4 Answers4

1

Scanning the target list to match the name is an O(n) operation, thus your loop is O(n^2). If you build a HashSet<string> of all the distinct names in the target list, you can check whether a name exists in the set in O(1) time using the Contains method.

Jonas Høgh
  • 10,358
  • 1
  • 26
  • 46
1

//getting data from database here

You are getting the data out of a system that specializes in matching and sorting and filtering data, into your RAM that by default cannot yet do that task at all. And then you try to sort, filter and match yourself.

That will fail. No matter how hard you try, it is extremely unlikely that your computer with a single programmer working at a matching algorithm will outperform your specialized piece of hardware called a database server at the one operation this software is supposed to be really good at that was programmed by teams of experts and optimized for years.

You don't go into a fancy restaurant and ask them to give you huge bags of raw ingredients so you can throw them into a big bowl unpeeled and microwave them at home. No. You order a nice dish because it will be way better than anything you could do yourself.

The simple answer is: Do not do that. Do not take the raw data and rummage around in it for hours. Leave that job to the database. It's the one thing it's supposed to be good at. Use it's power. Write a query that will give you the result, don't get the raw data and then play database yourself.

nvoigt
  • 75,013
  • 26
  • 93
  • 142
  • In some cases (like MVVM) you may only work with model objects without accessing database even. – Ugur Jan 09 '20 at 16:05
  • 1
    There is no rule in MVVM that you must intentionally cripple your program by not using a database for what is supposed to be used. You can have a method in your model that queries the database and returns a list of results. – nvoigt Jan 09 '20 at 16:37
1

I think it seems like a bad idea but IEnumerable magic will help you.

For starters, simplify your expression. It looks like this:

var result = sourceList.Where(s => targetList.Any(t => t.Equals(s)));

I recommend making a comparison in the Equals method:

public class CompareObject
{
    public string prop { get; set; }

    public new bool Equals(object o)
    {
        if (o.GetType() == typeof(CompareObject))
            return this.prop == ((CompareObject)o).prop;    
        return this.GetHashCode() == o.GetHashCode();
    }
}    

Next add AsParallel. This can both speed up and slow down your program. In your case, you can add ...

var result = sourceList.AsParallel().Where(s => !targetList.Any(t => t.Equals(s)));

CPU 100% loaded if you try to list all at once like this:

var cnt = result.Count();

But it’s quite tolerable to work if you get the results in small portions.

result.Skip(10000).Take(10000).ToList();

Full code:

static Random random = new Random();
public class CompareObject
{
    public string prop { get; private set; }

    public CompareObject()
    {
        prop = random.Next(0, 100000).ToString();
    }

    public new bool Equals(object o)
    {
        if (o.GetType() == typeof(CompareObject))
            return this.prop == ((CompareObject)o).prop;    
        return this.GetHashCode() == o.GetHashCode();
    }
}

void Main()
{
    var sourceList = new List<CompareObject>();
    var targetList = new List<CompareObject>();
    for (int i = 0; i < 10000000; i++)
    {
        sourceList.Add(new CompareObject());
        targetList.Add(new CompareObject());
    }

    var stopWatch = new Stopwatch();

    stopWatch.Start();
    var result = sourceList.AsParallel().Where(s => !targetList.Any(t => t.Equals(s)));


    var lr = result.Skip(10000).Take(10000).ToList();
    stopWatch.Stop();

    Console.WriteLine(stopWatch.Elapsed);
}

Update

I remembered what you can use Hashtable.Choos unique values from targetList and from sourceList next fill out the result whose values are not targetList.

Example:

static Random random = new Random();
public class CompareObject
{
    public string prop { get; private set; }
    public CompareObject()
    {
        prop = random.Next(0, 1000000).ToString();
    }

    public new int GetHashCode() {
        return prop.GetHashCode();
    }
}


void Main()
{
    var sourceList = new List<CompareObject>();
    var targetList = new List<CompareObject>();
    for (int i = 0; i < 10000000; i++)
    {
        sourceList.Add(new CompareObject());
        targetList.Add(new CompareObject());
    }

    var stopWatch = new Stopwatch();

    stopWatch.Start();
    var sourceHashtable = new Hashtable();
    var targetHashtable = new Hashtable();

    foreach (var element in targetList)
    {
        var hash = element.GetHashCode();
        if (!targetHashtable.ContainsKey(hash))
            targetHashtable.Add(element.GetHashCode(), element);
    }

    var result = new List<CompareObject>();
    foreach (var element in sourceList)
    {
        var hash = element.GetHashCode();
        if (!sourceHashtable.ContainsKey(hash))
        {
            sourceHashtable.Add(hash, element);
            if(!targetHashtable.ContainsKey(hash)) {
                result.Add(element);
            }
        }
    }

    stopWatch.Stop();

    Console.WriteLine(stopWatch.Elapsed);
}
Noneme
  • 816
  • 6
  • 22
0

Foreach performs a null check before each iteration, so using a standard for loop will provide slightly better performance that will be hard to beat.

If it is taking too long, can you break down the collection into smaller sets and/or process them in parallel?

Also you could look a PLinq (Parallel Linq) using .AsParallel()

Other areas to improve are the actual comparison logic that you are using, also how the data is stored in memory, depending on your problem, you may not have to load the entire object into memory for every iteration.

Please provide a code example so that we can assist further, when such large amounts of data are involved performance degredation is to be expected.

Again depending on the time that we are talking about here, you could upload the data into a database and use that for the comparison rather than trying to do it natively in C#, this type of solution is better suited to data sets that are already in a database or where the data changes much less frequently than the times you need to perform the comparison.

Chris Schaller
  • 13,704
  • 3
  • 43
  • 81