Comparison of two large datatables is slow

Question

I have two datatables A and B with approx. 50000 rows in each. I am comparing some of the columns (not complete raw) of table B to the datatable A, using two foreach loops. The comparision takes 2~3min to complete. Is there a way to improve performance?

    foreach (DataRow entry in B.Rows)
    {
        // Compare results
        foreach (DataRow dr in A.Rows)
        {
            if (entry[1].ToString().ToUpper()== dr[2].ToString().ToUpper() && entry[2].ToString().ToUpper()== dr[3].ToString().ToUpper())
            {
               // do stuff....
              entry[1]== dr[2];
              entry[2]== dr[3];
                break;
            }
        }
    }

What kind of comparison? 2~3min is a way far away which says you're doing something wrong — Sriram Sakthivel, Feb 11 '14 at 13:21
If I were you I would do something like: create a unique hashcode for each row (using all of the contents of the row or perhaps the columns you are checking) and store it in the tables and sort the data to, then you only need to compare one column in the future, and something like a left outer join would show you the differences (the others would be the same — Paul Zahra, Feb 11 '14 at 13:29

score 3 · Answer 1 · edited Jun 20 '20 at 09:12

What you are doing has O(n^2) complexity, which is bad performance wise. If it is bad now, it will be even worse when you add more lines to the tables.

The algorithm would be considerably faster if you could sort both the tables before. In that case you could loop through both the tables on the same time.

Another more simple option (but not as fast) is to convert the first of the tables to a dictionary instead, to allow random access to it when looping through the second table.

Edit

It should be possible to do this with LINQ instead, that way you won't have to do the heavy algorithm implementation yourself.

var matching = from br in B.Rows
join ar in A.Rows on ar[1] equals br[2]
where ar[2] == br[3]
select new { ar, br };

foreach (var match in matching)
{
  // Do stuff.
}

Could you elaborate a bit about loop through both the tables on the same time? any sample? — Jim, Feb 11 '14 at 13:26

score 3 · Accepted Answer · edited May 23 '17 at 12:20

You can use Linq-To-DataSet, Enumerable.Join is much more efficient since it's using a Dictionary/Lookup under the hood.

var join = from rowA in A.AsEnumerable()
           join rowB in B.AsEnumerable()
           on new { col1 = rowA.Field<string>(2), col2 = rowA.Field<string>(3) }
           equals new { col1 = rowB.Field<string>(1), col2 = rowB.Field<string>(2) }
           select new { rowA, rowB };
foreach (var bothRows in join)
{
    DataRow rowA = bothRows.rowA;
    DataRow rowB = bothRows.rowB;
    // do stuff....
}

Why is LINQ JOIN so much faster than linking with WHERE?

Update, acc. your comment:

I need to find and mark in the B table rows which are not in the A table. So basically on your proposed solution I would like to get join which will have like (entry1 != dr[2] || entry[2] != dr[3]))

Then you could use a combination of Enumerable.Except and Join:

var bKeyCols = B.AsEnumerable()
    .Select(r => new { col1 = r.Field<string>(1), col2 = r.Field<string>(2) });
var aKeyCols = A.AsEnumerable()
    .Select(r => new { col1 = r.Field<string>(2), col2 = r.Field<string>(3) });
var bNotInA = bKeyCols.Except(aKeyCols);
var bRowsNotInA = from row in B.AsEnumerable()
                  join keyCols in bNotInA
                  on new { col1 = row.Field<string>(1), col2 = row.Field<string>(2) } equals keyCols
                  select row;

Enumerable.Except also uses a set which makes it very efficient at the cost of some memory.

Actually this one working great... I have a small issue though I need to catch the case when (entry[1] != dr[2] || entry[2] != dr[3]), how would I implement this join? — Jim, Feb 11 '14 at 14:08
@Jim: what means that you need to handle this case, do you want to exclude these rows? — Tim Schmelter, Feb 11 '14 at 14:19
I need to find and mark in the B table rows which are not in the A table. So basically on your proposed solution I would like to get join which will have like (entry[1] != dr[2] || entry[2] != dr[3])) — Jim, Feb 11 '14 at 14:23
@Jim: try my approach above. Another way would be a "LEFT OUTER JOIN" in LINQ. http://msdn.microsoft.com/en-us/library/bb397895.aspx — Tim Schmelter, Feb 11 '14 at 14:33

score 2 · Answer 3 · answered Feb 11 '14 at 13:27

It's hard to say where is a real problem, but you may try to use DataTable built-in query possibility.

Pseudocode:

DataView dv = new DataView(A);
foreach (DataRow entry in B.Rows)
{
   dv.RowFilter = "COLUMN_A1 = " +  entry[1] + " AND COLUMN_A2=" + entry[2];
   foreach (DataRowView drv in dv) {          
        //DO SOMETHING
   }
}

where COLUMN_A1 and COLUMN_A2 are respecive column names from A table

Qr simply

first get all data you need to process in one collection
process after that collection only once.

Consider that in this case you will have increased memory pressure.

Comparison of two large datatables is slow

3 Answers3

Edit