PowerShell: Why are these file comparison times so different?

Question

Edit: I have changed the title of this question from "PowerShell: Why is this timing not working?" I originally thought the times reported had to be wrong, but I was wrong about that. The times reported were correct, and what I learned from the discussion of the question was why the times were so different. So the new title better describes what can be learned from this Q&A.

I'm writing a script to compare the contents of two folders, including a binary comparison if the size and timestamp are the same. I want to monitor how quickly it does the comparisons. But my results of that are way out of whack.

Here is an excerpt of my code that just tests the monitoring of the speed of comparison.

$sFolder_1 = "<path of folder 1, including final \>"
$sFolder_2 = "<path of folder 2, including final \>"

get-ChildItem -path $sFolder_1 -Recurse | ForEach-Object `
   {$oItem_1   = $_
    $sItem_1   = $oItem_1.FullName
    $sItem_rel = $sItem_1.Substring($nLen_1)
    $sItem_2   = join-path $sFolder_2 $sItem_rel
    if(Test-Path -Type Container $sItem_1) {$sFile = ""} else {$sFile = "F"}

    # Check for corresponding item in folder 2:
    if (-not (Test-Path $sItem_2)) `
       {$sResult = "Not in 2"}
      else
        # If it's a file, compare in both folders:
       {if ($sFile -eq "") `
           {$sResult = "Found"}
          else
           {$nSize_1 = $oItem_1.Length
            $dTimeStart = $(get-date)
            $nKb = ($nSize_1 / 1024)
            Write-Output "$dTimeStart : Checking file ($nKb kb)"
            if (Compare-Object (Get-Content $sItem_1) (Get-Content $sItem_2)) `
               {$sResult = "Dif content"}
              else
               {$sResult = "Same"}
            $nTimeElapsed = ($(get-date) - $dTimeStart).Ticks / 1e7
            $nSpeed = $nKb / $nTimeElapsed
            Write-Output "$nKb kb in $nTimeElapsed seconds, speed $nSpeed kb/sec."
        }   }
    Write-Output $sResult
    }

Here is the output from running that on a particular pair of folders. The four files in the two folders are all "gvi" files, which is a type of video file.

08/05/2023 08:58:41 : Checking file (75402.453125 kb)
75402.453125 kb in 37.389018 seconds, speed 2016.70054894194 kb/sec.
Same
08/05/2023 08:59:18 : Checking file (67386.28515625 kb)
67386.28515625 kb in 22.6866484 seconds, speed 2970.30588071573 kb/sec.
Same
08/05/2023 08:59:41 : Checking file (165559.28125 kb)
165559.28125 kb in 5.6360258 seconds, speed 29375.1815774158 kb/sec.
Same
08/05/2023 08:59:47 : Checking file (57776.244140625 kb)
57776.244140625 kb in 2.059942 seconds, speed 28047.5101437929 kb/sec.
Same

This says that the comparison ran ten times faster on the third and fourth files than on the first two. That doesn't make sense. I'm guessing that there's something about the way PowerShell is optimizing the process that is causing the difference. Is there a way to find out the actual time spend doing each comparison?

The file may be in cache so it runs faster than opening the file. — jdweng, Aug 05 '23 at 15:35
How many *lines* of text are there in each file? I imagine that’s a significant factor in the time required for ```Compare-Object``` to run - e.g files with 100,000 lines of 50 characters would be a lot slower than files with 1 line of 5,000,000 characters even though the total size is the same since there’s combinatorial explosion in the number of comparisons required. (If you just want to compare the full file contents you can use ```Get-Content … -Raw```, although low-level dotnet io methods would be *even* faster to load the file contents) — mclayton, Aug 05 '23 at 16:27
@mclayton - There are no lines of text. As I said, these are video files. — NewSites, Aug 05 '23 at 17:32
@mclayton - What are low-level dotnet io methods? Can I use them in PowerShell? — NewSites, Aug 05 '23 at 17:35
@NewSites - they might be binary files, but ```Get-Content``` will still split on line break characters and return an *array* of strings. The more line break characters in the file, the more items in the array returned by ```Get-Content``` - see the docs at https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.management/get-content?view=powershell-7.3 where it says “For files, the content is read one line at a time and returns a collection of objects, each representing a line of content.” — mclayton, Aug 05 '23 at 17:39
@jdweng - I watched a bit of the first video a few days ago and have not watched any of the others in a long time. So if any of these files is in cache, it would be the first one, not the third and fourth. — NewSites, Aug 05 '23 at 17:39
@mclayton - Well, interesting. I opened the four files in Notepad++, counted `\r`s and `\n`s. Results: File 1: `\r` 272,652, `\n` 291,178, File 2: `\r` 189,941, `\n` 197,111, File 3: `\r` 398, `\n` 721, File 4: `\r` 2,916, `\n` 3,130. Astonishing! I know nothing of gvi format, don't know what `\r` and `\n` are doing there, but how can there be only 10^2 in longest video and 10^6 in two shorter ones? Anyway, this is consistent with your predicted correlation between speed and count of newlines. (File 3 should be fastest, but it's also the largest file, so that would slow it down.) Amazing. — NewSites, Aug 05 '23 at 18:15
@mclayton - See my answer based on your comments, and consider adding your own answer based on those comments. Thank you for the very helpful information! — NewSites, Aug 05 '23 at 18:59

mclayton · Accepted Answer · 2023-08-06T08:48:40.637

If we rephrase your question to:

Why are the timings not meeting my expectations?

Then the answer is easy:

Because your expectations are wrong :-)

You're expecting the time taken to vary with the size of the file, but your code is actually doing something that means a significant part of the performance is based on the number of line break character sequences in the files!

So there's a couple of issues with your approach:

Problem 1: With binary data files, the expressions (Get-Content $sItem_1) and (Get-Content $sItem_2) basically retrieve mangled arrays of stringified binary data, the number of items in which are determined by the number of "line-break-like" sequences in the binary content.

Get-Content is primarily meant for use with text-based files - by default it will decode and split a file into lines of text based on line break sequences it finds in the file - see Get-Content:

the content is read one line at a time and returns a collection of objects, each representing a line of content.

This means that any byte sequences in the binary file that happen to look like line breaks will be treated as line breaks regardless of their meaning in the binary file's native format. The number of strings in the return array from Get-Content will correlate with the number of accidental line break sequences in the binary file.

In your sample data this ranges from hundreds of items through to hundreds of thousands of items, and doesn't really seem to relate to the size of the file.

Problem 2: Performance of Compare-Object correlates loosely to the number of items in the input collections.

When called with arrays of objects as inputs, Compare-Object attempts to pair up equal values in the left and right sides and returns any items that it fails to find a partner for. As the number of input items increases, so the time taken increases, and the increase can be combinatorially explosive with the wrong data.

From your own test data you can see the number of "line breaks" correlates with the processing time:

File	Size	CR Count	LF Count	Time	Kb/s
File 1	75mb	272,652	291,178	37s	2,016
File 2	67mb	189,941	197,111	22s	2,970
File 3	165mb	398	721	5s	29,375
File 4	57mb	3,130	28,847	2s	28,047

It also compares items exhaustively and returns all diferences, but you only really care if there are zero or more than zero differences. If the first byte of both files differs you could exit the check without checking the other hundreds of megabytes of file contents...

Possible fixes:

Get-Content -Raw

One option is to use the -Raw switch on Get-Content which forces it to read the entire file contents into a single string and ignore line breaks.

You don't really need to use Compare-Object if you do this - you can just to a simple string comparison:

if ((Get-Content $sItem_1 -Raw) -eq (Get-Content $sItem_2 -Raw))

However, you're still creating mangled stringified representations of the binary data, which isn't ideal and you're processing the whole file even if the first byte is different.

Get-Content -AsByteStream

Yet another option is to use the -AsByteStream switch on Get-Content - this will return an array of bytes instead of a string, but you'll need to modify the call to Compare-Object as well:

Compare-Object @(,(Get-Content $sItem_1 -AsByteStream)) @(,(Get-Content $sItem_2 -AsByteStream))

Note the return value from Get-Content is wrapped in an outer array @(, ... ) - this forces Compare-Object to compare the two arrays as ordered lists, rather than as sets of values. See the two examples below:

# nothing returned because the arrays are treated as *sets* with 2 matching items in, not an ordered list
PS> Compare-Object @(0, 1) @(1, 0)

# inputs are treated as containing a single ordered-list item, and the lists are not the same
PS> Compare-Object @(,@(0, 1)) @(,@(1, 0))

InputObject SideIndicator
----------- -------------
{1, 0}      =>
{0, 1}      <=

In this case you could do:

if( Compare-Object @(,(Get-Content $sItem_1 -AsByteStream)) @(,(Get-Content $sItem_2 -AsByteStream)) )

... athough this still reads in the whole file though even if the first byte is different.

update - as suggested by @mklement0, using -Raw as well as-AsByteStream will improve performance as the entire file contents is returned as a single byte array, rather than a drip-fed pipeline consisting of individual bytes one-at-a-time that have to be collected into an array anyway.

The updated code would look like:

if( Compare-Object @(,(Get-Content $sItem_1 -AsByteStream -Raw)) @(,(Get-Content $sItem_2 -AsByteStream -Raw)) )

Get-FileHash

You could also take a completely different approach and compare the hashes of the files with Get-FileHash. That's presumably optimised to be memory-efficient (e.g. not storing the whole file in memory at once) and properly treats the binary data as binary data.

As with the other two approaches though, it will still process the whole file before comparing the hashes. To fix that you might need to drop down to native dotnet methods, but this answer is already pretty long, so you could maybe search for "compare binary files in powershell" to research that...

Nicely done. Note that you can combine `-AsByteStream` with `-Raw`, which directly returns a `[byte[]]` array, and is much faster than streaming bytes one by one without `-Raw`. — mklement0, Aug 05 '23 at 22:16
@mklement0 - I’ll edit that into my answer next time I get a moment :-) — mclayton, Aug 05 '23 at 22:39
I did the search you suggested, and found this code for buffered comparison in PS: https://stackoverflow.com/questions/19990788/powershell-binary-file-comparison#22800663 . It's pretty simple and looks good. What do you think of it? — NewSites, Aug 06 '23 at 07:53
@NewSites - the OP claims that function works, so I guess you could test and see if you’re happy with it. I’ve got 2 observations though: (i) could do with a ```try … catch``` to ensure streams are closed if there are any exceptions (ii) I’ve debated this (inconclusively) with @mklement0 before, but ```Read``` doesn’t *guarantee* to return the full amount of requested bytes, although it *normally* does. The code you linked doesn’t account for that detail, but *probably* still works fine. The documentation for ```Read``` shows… — mclayton, Aug 06 '23 at 08:31
… an example where it checks the return value for th number of bytes read - see https://learn.microsoft.com/en-us/dotnet/api/system.io.filestream.read?view=net-7.0#system-io-filestream-read(system-byte()-system-int32-system-int32). You could add an assert to check the return matches the requested to at least throw if it’s different, but other than that, try it and see :-). — mclayton, Aug 06 '23 at 08:36
@mclayton - I've posted a follow-up Q&A in which I report on the speeds of various methods of comparison, including what I learned from you here. Would be interested in your thoughts on it: https://stackoverflow.com/questions/76895989/speed-of-binary-file-comparisons-in-powershell/ — NewSites, Aug 14 '23 at 03:04
@mklement0 - You're also obviously knowledgeable on this subject, so would be interested in your thoughts on that speed comparison as well: https://stackoverflow.com/questions/76895989/speed-of-binary-file-comparisons-in-powershell/ — NewSites, Aug 14 '23 at 03:06

NewSites · Answer 2 · 2023-08-06T13:30:00.060

I want to give credit for this answer to @mclayton , who guided me to it with his comments on the question. If @mclayton will repeat the information from his comments in an answer, I'll accept that answer.

It turns out that the timing reported by the script was correct. The cause of the difference in the comparison times was the number of newline characters in the files. I know nothing of the gvi format and don't know what \r and \n would be doing there, but following on what @mclayton said in his comments, I opened the four video files in Notepad++ and counted \rs and \ns. Results:

File 1: \r 272,652, \n 291,178
File 2: \r 189,941, \n 197,111
File 3: \r 398, \n 721
File 4: \r 2,916, \n 3,130.

Astonishing! How can there be only a few hundred of these characters in the largest video file, but hundreds of thousands in two of the shorter ones? However little sense this made, it was consistent with @mclayton's predicted correlation between speed and count of newlines. (File 3 should be fastest by that criterion, but it's also the largest file, so that would slow it down.)

@mclayton also said I could overcome the slow-down caused by those newlines by using the -raw parameter in get-content. So I tried that. I modified the code in the question by changing the line:

if (Compare-Object (Get-Content $sItem_1) (Get-Content $sItem_2)) `

to

if (Compare-Object (Get-Content $sItem_1 -raw) (Get-Content $sItem_2 -raw)) `

New output:

08/05/2023 13:31:00 : Checking file (75402.453125 kb)
75402.453125 kb in 2.1687595 seconds, speed 34767.5494332129 kb/sec.
Same
08/05/2023 13:31:02 : Checking file (67386.28515625 kb)
67386.28515625 kb in 2.2484505 seconds, speed 29970.0994779516 kb/sec.
Same
08/05/2023 13:31:04 : Checking file (165559.28125 kb)
165559.28125 kb in 6.2925156 seconds, speed 26310.5078754195 kb/sec.
Same
08/05/2023 13:31:11 : Checking file (57776.244140625 kb)
57776.244140625 kb in 2.0886595 seconds, speed 27661.8779368418 kb/sec.
Same

So the information from @mclayton has not only explained the speed difference, but has allowed me to speed up the comparisons for some files by about ten times!

Edit:

The code with that change worked fine until it came to a file of 3.7 Gb. Then it crashed with an OutOfMemoryException. So I need to look more into the alternatives discussed in the answer by @mclayton.

@mclayton - I'd still like to know what you meant by "low-level dotnet io methods" in one of your comments on the question, and whether I can use them in PowerShell. — NewSites, Aug 05 '23 at 19:00
In retrospect, using ```Get-Content``` *at all* is probably the wrong tool for the job as you're treating a binary file as a text file - when ```Get-Content``` reads the file it will convert individual bytes into UTF-16 characters which will take double the space in memory. Also, for very large files, you'll end up reading the whole file into memory even if the **first** byte is different. You might be better off searching for something like "compare binary files in powershell" for a better approach that works by reading smaller *chunks* of *bytes* from the file and comparing those... — mclayton, Aug 05 '23 at 19:03
I should maybe have said *native* dotnet methods rather than *low-level* - e.g. https://learn.microsoft.com/en-us/dotnet/api/system.io.file.readallbytes?view=net-7.0 which would be something like ```[System.IO.File]::ReadAlllBytes($sItem_1)``` in PowerShell. — mclayton, Aug 05 '23 at 21:20

PowerShell: Why are these file comparison times so different?

2 Answers2

Get-Content -Raw

Get-Content -AsByteStream

Get-FileHash

Linked