1

I have a script that checks if an URL from file 1 exists in file 2, and if not, writes it to the output file. It works fine, here it is:

Write-Host "Script output will have unique items from file 1"
$FirstPath = Read-Host -Prompt "Input file location of first .csv file"
$SecondPath = Read-Host -Prompt "Input file location of second .csv file"

Write-Host "Importing CSV files..."
$FirstFile = Import-Csv $FirstPath -Delimiter ';' |
             Select-Object -ExpandProperty Url
$SecondFile = Import-Csv $SecondPath -Delimiter ';' |
              Select-Object -ExpandProperty ITEM_TARGET_URI

Write-Host "Comparing files..."
Compare-Object -ReferenceObject $FirstFile -DifferenceObject $SecondFile -PassThru |
    Where-Object { $_.SideIndicator -eq "<=" } |
    Out-File -Encoding Utf8 .\result.txt

Write-Host "Done, press any key to continue..."
$x = $Host.UI.RawUI.ReadKey("NoEcho,IncludeKeyDown")

My issue is, that when working on large CSV files (one has for example 4 000 000 records), the script works a whole night and still isn't finished. I can't also see if there is any progress. I'd like to make it work faster, or at least have some info about progress in work. I've read about progressbar, etc. But it won't work since it is comparing in only 1 line, not in a loop.

What can I change in my script to make it work faster, and/or be able to see progress?

Edit: the problem is different from a thread sugested, I'm mainly focusing on working with large files and improving speed of my script. Solution sugested there is not answering this problem.

Wolwgang
  • 101
  • 3
  • 13
  • Have you tried implementing this in another language? Something like Python would give you still that high-level experience, but you would be able to multi-thread it and give output where you need. -- I say this because I have my doubts about Powershell running this command in multiple threads – David Culbreth Aug 23 '17 at 13:28
  • @DavidCulbreth I have not tried it yet, since it was working nice on lower files. But i will give it a try if no one will have any idea what how can I modify this script. – Wolwgang Aug 23 '17 at 13:31
  • Possible duplicate of [Comparing two arrays & get the values which are not common](https://stackoverflow.com/questions/6368386/comparing-two-arrays-get-the-values-which-are-not-common) – iRon Aug 23 '17 at 16:35

2 Answers2

1

Since you apparently want only the URLs from the first file that are not present in the second file you may want to try something like this:

$ref = Import-Csv $SecondPath -Delimiter ';' |
       Select-Object -Expand ITEM_TARGET_URI

Import-Csv $FirstPath -Delimiter ';' |
    Select-Object -Expand Url |
    Where-Object { $ref -notcontains $_ } |
    Out-File -Encoding UTF8 .\result.txt

This is assuming that Compare-Object is your actual bottleneck. Did you verify that (by timing both import statements and the compare statement)?

If the comparison operation is not the major bottleneck (e.g. data import takes equally long or longer), you may want to replace Import-Csv with something like

$ref = Get-Content $SecondPath |
       Select-Object -Skip 1 |
       ForEach-Object { $_.Split(';')[5] }

or like this:

$reader = New-Object IO.StreamReader $SecondPath

[void]$reader.ReadLine()  # skip header line
$ref = while ($reader.Peek() -gt 0) {
    $reader.ReadLine().Split(';')[5]
}

$reader.Close(); $reader.Dispose()

Replace 5 with the index of the column whose value you want to extract.


Addendum: You should be able to speed up processing considerably by using hashtable lookups instead of -notcontains.

$ref = @{}
Import-Csv $SecondPath -Delimiter ';' | ForEach-Object {
    $ref[$_.ITEM_TARGET_URI] = $true
}

Import-Csv $FirstPath -Delimiter ';' |
    Where-Object { -not $ref.ContainsKey($_.Url) } |
    Select-Object -Expand Url |
    Out-File -Encoding UTF8 .\result.txt
Ansgar Wiechers
  • 193,178
  • 25
  • 254
  • 328
  • Yes, I've checked that and it is Compare-Object that takes too long. I'll try your solution – Wolwgang Aug 24 '17 at 07:06
  • It's not incredibly faster, but at least I can see progress, 'cause your solution writes to result file after each line, not all at once like my version. – Wolwgang Aug 24 '17 at 10:02
  • 1
    @Wolwgang Try hashtable lookups. See updated answer. – Ansgar Wiechers Aug 24 '17 at 11:07
  • Works like a charm, fast as hell. Now I'll just have to read about those hashable lookups why is it so fast :D – Wolwgang Aug 24 '17 at 11:52
  • AFAIK hashtable lookups scale sublinear (O(log n)? I don't think it's O(1)) b/c they don't do linear reads through the list of elements like `-notcontains` does. – Ansgar Wiechers Aug 24 '17 at 12:04
0
$FirstFile | Where {$SecondFile -NotContains $_}
iRon
  • 20,463
  • 10
  • 53
  • 79