1

Environment is Windows Server 2016. So I have a folder with 290k subfolders. Those subfolders need to be moved. I also have a CSV file with two fields, one is the (numeric) folder name, the other one is the destination. The problem: the CSV contains additional rows, not only the 290k I need, but about 5.3 million rows.

Structure of the CSV is basically:

"ORDER","CLIENT"
11223344,1111
22334455,1111
33445566,2222
44556677,1390

To clarify, every folder name is a ORDER. The order folders shall be sorted into CLIENT folders. Orders are unique. Not every order has a folder.

My first approach was to load the CSV, then foreach through it. It was utterly slow since I had to check every entry if it was actually an existing folder.

$rootfolder = path-to-folders
$csvpath = path-to-csv

$csv = Import-csv -path $csvpath

foreach($row in $csv)
{ 
    $path = $rootfolder + $row.ORDER
    $target = $row.CLIENT
    if (Test-Path -Path $path) {
        # copy files to $rootfolder + $target
    }
}

Second approach was getting a list of the subfolders first, walk through that and look up the corresponding value with where-object from the CSV. That turned out even slower.

$rootfolder = path-to-folders
$csvpath = path-to-csv

$folders = Get-ChildItem -path $rootfolder
$csv = Import-csv -path $csvpath

foreach ($folder in $folders) {
    $currentpath = $rootfolder + $folder
    if (Test-Path -Path $currentpath -PathType Container) {
        $target = $csv | Where-Object ORDER -eq $folder
        # copy files to $rootfolder + $target
    }
}

My next idea would be to load that CSV in a temporary database and do the lookup from there since it should (?) be way faster.

But maybe there are more elaborated ways compared to my caveman approach? Thanks for reading.

mitspieler
  • 21
  • 4
  • Can you expand on what you mean by "the (numeric) folder name"? Are the values in the CSV file _not_ the exact folder names? And are the folder names all _unique_? – Mathias R. Jessen Mar 28 '23 at 17:25
  • Does the .csv file contain all possible (numeric) directory names? And, do multiple (numeric) directory names have the same target destination directory in the .csv file? Putting a small sample of directory names and lines from the .csv file (as text) into the question might help others understand better the situation. – lit Mar 28 '23 at 17:30
  • The CSV has a structure like 5234156,1280. Those are order numbers, second are client ids. Order numbers are unique. Users want those 290k folders split up in subfolders by client id. The first value is exactly the name of the subfolder. However not every order has a folder, that's why I have so much excess data in the CSV. I added that to the question. – mitspieler Mar 28 '23 at 17:31
  • 2
    The real slowdown will probably come from checking 290K folders, though looking up 5.3M keys will also not be very fast in PS. For getting existing directories and plonking them in a hash table, `[System.IO.Directory]::EnumerateDirectories(...)` or `GetDirectories` (depends a little on where the overhead lies) should be the fastest way (definitely not `Get-ChildItem -Directory`). Reading the CSV should probably not be done with `ConvertFrom-Csv` or even `Get-Content` but a more plain `foreach ($line in [System.IO.File]::ReadLines())`. – Jeroen Mostert Mar 28 '23 at 17:50
  • The real slowdown is dealing with the CSV afaict. Reading the 290k folders into a variable with Get-ChildItem is reasonably fast. But getting the CSV with Import-CSV alone takes over 10 minutes. I will try the suggestion to parse the CSV manually and update with the results. – mitspieler Mar 28 '23 at 18:52
  • 1
    Can you please add to your question the current code that seem to work but is slow ? – Santiago Squarzon Mar 28 '23 at 20:32
  • The solution you were able to craft with help from @JeroenMostert should be in its own self-answer instead of in your question – Santiago Squarzon Mar 29 '23 at 13:52
  • Oh, ok, will do that. Thanks! – mitspieler Mar 29 '23 at 13:53

2 Answers2

1

According to comments I tried the following

$reader = [System.IO.File]::OpenText($csvpath)
while($null -ne ($line = $reader.ReadLine().split(','))) {
    $currentpath = $rootfolder + $line[0]
    if (Test-Path -Path $currentpath -PathType Container) {
        # Copy $currentpath to $line[1]
    }
}

This is faster, it processes like 300-400 folders per second.

Start: 03/29/2023 10:33:31
End: 03/29/2023 10:48:44

That's totally fine for my usecase. Thanks to Jeroen Mostert for pointing me to the flaw that reading the CSV before processing it was the biggest culprit and unnecessary.

Another mention, the first two "solutions" use immense amount of resources. The script was hugging like 5.5G RAM. With the improved readline() version it runs below 30M.

mitspieler
  • 21
  • 4
0

These two statements appear to contradict each other (or I do not understand):

  • To clarify, every folder name is a ORDER
  • $target = $csv | Where-Object CLIENT -eq $folder

Anyways (taking "To clarify, every folder name is a ORDER"), using a lookup table will likely make a lot of difference in terms of performance. To do this, you might use this Join-Object script/Join-Object Module or build your own lookup command, see: In Powershell, what's the best way to join two tables into one?:

Assuming you have the following (sub)folders:

Get-ChildItem $rootfolder

    Directory: C:\rootfolder

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
d-r--           3/29/2023  2:59 PM                11223344
d-r--           3/29/2023  3:00 PM                12345678
d-r--           3/29/2023  2:59 PM                22334455
d-r--           3/29/2023  2:59 PM                33445566
d-r--           3/29/2023  3:00 PM                44556677

And the following Csv file:

# $csv = Import-csv -path $csvpath
$csv = ConvertFrom-Csv @'
"ORDER","CLIENT"
11223344,1111
22334455,1111
33445566,2222
44556677,1390
'@

You probably want to do a left join (which will reveal the missing CLIENTs):

Get-ChildItem -Directory $rootfolder |
    LeftJoin $Csv -on Name -eq Order -Property FullName, Order, Client

FullName               Order    Client
--------               -----    ------
C:\rootfolder\11223344 11223344 1111
C:\rootfolder\12345678 12345678
C:\rootfolder\22334455 22334455 1111
C:\rootfolder\33445566 33445566 2222
C:\rootfolder\44556677 44556677 1390

This means from here you can do things along with:

Get-ChildItem -Directory $rootfolder |LeftJoin $Csv -on Name -eq Order |Foreach-Object {
    if ($_.Client) {
        Write-Host "Copy $($_.FullName) to $($_.Client)"
    }
    else {
        Write-Warning "No client found for $($_.Name)"
    }
}

Copy C:\rootfolder\11223344 to 1111
WARNING: No client found for 12345678
Copy C:\rootfolder\22334455 to 1111
Copy C:\rootfolder\33445566 to 2222
Copy C:\rootfolder\44556677 to 1390
iRon
  • 20,463
  • 10
  • 53
  • 79
  • You are right, the statement was wrong, I fixed it. Sorry for the confusion. Will try your solution, thanks! – mitspieler Mar 29 '23 at 13:52