1

I have two large CSVs to compare. Bosth csvs are basically data from the same system 1 day apart. No of rows are around 12k and columns 30.

The aim is to identify what column data has changed for primary key(#ID).

My idea was to loop through the CSVs to identify which rows have changed and dump these into a separate csvs. One done, I again loop through the changes rows, and indetify the exact change in column.

    NewCSV = Import-Csv -Path ".\Data_A.csv"
    OldCSV = Import-Csv -Path ".\Data_B.csv"
     
   foreach ($LineNew in $NewCSV)
    {
        ForEach ($LineOld in $OldCSV)
        {
            If($LineNew -eq $LineOld)
            {
                Write-Host $LineNew, " Match"
            }else{
                Write-Host $LineNew, " Not Match"
            }
        }
    }

But as soon as run the loop, it takes forever to run for 12k rows. I was hoping there must be a more efficient way to compare large files powershell. Something that is quicker.

misguided
  • 3,699
  • 21
  • 54
  • 96
  • You only care about _"what has changed"_ on `$NewCsv` from `$OldCsv` or side-by-side comparison? Also, does `ID` have unique values on both CSVs? – Santiago Squarzon Dec 17 '21 at 13:32
  • I want to know both the old value and new value for changed columns for a particular ID – misguided Dec 17 '21 at 13:34
  • What about **Compare-Object (Get-content Data_A.csv) (Get-Content Data_B.csv)** ? – Sergiu Dec 17 '21 at 13:39
  • 1
    Why it takes forever: by comparing each row in oldcsv against each row of newcsv makes 12k*12k comparisons, so about 144 million operations. – vonPryz Dec 17 '21 at 13:43
  • This is a quiet classic issue. Using this [`Join-Object script`](https://www.powershellgallery.com/packages/Join)/[`Join-Object Module`](https://www.powershellgallery.com/packages/JoinModule) (see also: [In Powershell, what's the best way to join two tables into one?](https://stackoverflow.com/a/45483110/1701026)): `Import-Csv .\Data_A.csv |Join (Import-Csv .\Data_B.csv) -on ID -Name 'A.', 'B.'` – iRon Dec 17 '21 at 16:16

2 Answers2

1

Well you can give this a try, I'm not claiming it will be fast for what vonPryz has already pointed out but it should give you a good side-by-side perspective to compare what has changed from OldCsv to NewCsv.

Note: Those cells that have the same value on both CSVs will be ignored.

$NewCSV = Import-Csv -Path ".\Data_A.csv"
$OldCSV = Import-Csv -Path ".\Data_B.csv" | Group-Object ID -AsHashTable -AsString

$properties = $newCsv[0].PSObject.Properties.Name

$result = foreach($line in $NewCSV)
{
    if($ref = $OldCSV[$line.ID])
    {
        foreach($prop in $properties)
        {
            if($line.$prop -ne $ref.$prop)
            {
                [pscustomobject]@{
                    ID = $line.ID
                    Property = $prop
                    OldValue = $ref.$prop
                    NewValue = $line.$prop
                }
            }
        }
        continue
    }

    Write-Warning "ID $($line.ID) could not be found on Old Csv!!"
}
Santiago Squarzon
  • 41,465
  • 5
  • 14
  • 37
  • This solution is much faster that the one I was trying myself. One issue though, it randomly read some values as `System.Object[]` and hence shows data mismatch sometime. – misguided Dec 18 '21 at 04:31
  • @misguided without looking at the actual CSVs I don't see how that could be possible – Santiago Squarzon Dec 18 '21 at 04:43
  • 1
    The only possible way I could see it returning `Object[]` is if the OldCsv or NewCsv did not have Unique Values on the ID Column, which you already mentioned values were unique, if this was not the case then it should've been clarified. – Santiago Squarzon Dec 18 '21 at 04:54
  • Awesome...that was the cause. I went back and dug a bit deeper into data and sawa few duplicates which as you say were causing the issue. I have removed them now and it works fine. Less than a minute for comparing 2 csv files with 12k rows each. – misguided Dec 18 '21 at 05:12
  • @misguided good to know, glad it worked :) – Santiago Squarzon Dec 18 '21 at 05:19
1

As vonPryz hints in the comments, you've written an algorithm with quadratic time complexity (O(n²) in Big-O notation) - every time the input size doubles, the number of computations performed increase 4-fold.

To avoid this, I'd suggest using a hashtable or other dictionary type to hold each data set, and use the primary key from the input as the dictionary key. This way you get constant-time lookup of corresponding records, and the time complexity of your algorithm becomes near-linear (O(2n + k)):

$NewCSV = @{}
Import-Csv -Path ".\Data_A.csv" |ForEach-Object {
  $NewCSV[$_.ID] = $_
}

$OldCSV = @{}
Import-Csv -Path ".\Data_B.csv" |ForEach-Object {
  $OldCSV[$_.ID] = $_
}

Now that we can efficiently resolve each row by it's ID, we can inspect the whole of the data sets with an independent loop over each:

foreach($entry in $NewCSV.GetEnumerator()){
  if(-not $OldCSV.ContainsKey($entry.Key)){
    # $entry.Value is a new row, not seen in the old data set
  }

  $newRow = $entry.Value
  $oldRow = $OldCSV[$entry.Key]

  # do the individual comparison of the rows here
}

Do another loop like above, but with $NewCSV in place of $OldCSV to find/detect deletions.

Mathias R. Jessen
  • 157,619
  • 12
  • 148
  • 206
  • You cannot call a method on a null-valued expression. At C:\Data\FIle.PS1:39 char:19 + foreach($entry in $NewCSV.GetEnumerator()){ + ~~~~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : InvalidOperation: (:) [], RuntimeException + FullyQualifiedErrorId : InvokeMethodOnNull – misguided Dec 17 '21 at 14:19
  • @misguided Did you run the second block of code without running the first block of code in my answer? :) First line, `$NewCSV = @{}`, assigns a hashtable to `$NewCSV`, so if you run them in order, that error won't occur – Mathias R. Jessen Dec 17 '21 at 14:22
  • I did actually. But still getting the error. The data set has some null values in some columns(not primary key). Could that be an issue? – misguided Dec 17 '21 at 14:25
  • No, `Import-Csv` would spit out empty strings rather than null values. Also, the `$NewCSV` variable still _exists_ regardless of whether the CSV is empty or not. – Mathias R. Jessen Dec 17 '21 at 14:27
  • What does this do ? ForEach-Object {$NewCSV[$_.ID] = $_} ??? Normal import csv loads data fine. When I pipe it with this code, the array is blank. I am assuming ID is the "primary Key" column name ? – misguided Dec 18 '21 at 01:03
  • @misguided Exactly, you'd have to replace `ID` with the name of the column contain the primary key for this to work – Mathias R. Jessen Dec 18 '21 at 01:54