PowerShell Slowness Two Large CSV Files

Question

I have this script working, but with 100k+ rows in File1 and 200k+ in file 2, it will take days to complete. I got the where.({ part down to less than a second, with both csv files as data tables, but with that route I can't get the data out the way I want. This script outputs the data the way I want, but it takes 4 seconds per lookup. What can I do to speed this up?

I thought containskey somewhere might help, but on PRACT_ID there is a one to many relationship, so not sure how to handle those? Thx.

Invoke-Expression "C:\SHC\MSO\DataTable\functionlibrary.ps1" 
[System.Data.DataTable]$Script:MappingTable = New-Object System.Data.DataTable
$File1 = Import-csv "C:\File1.csv" -Delimiter '|' | Sort-Object PRACT_ID
$File2 = Get-Content "C:\File2.csv" | Select-Object -Skip 1 | Sort-Object 
$Script:MappingTable = $File1 | Out-DataTable

$Logs = "C:\Testing1.7.csv" 

[System.Object]$UserOutput = @()

foreach ($name in $File1) {

    [string]$userMatch = $File2.Where( { $_.Split("|")[0] -eq $name.PRACT_ID })

    if ($userMatch) {
        # Process the data

        $UserOutput += New-Object PsObject -property @{
            ID_NUMBER                = $name.ID_NUMBER
            PRACT_ID                 = $name.PRACT_ID
            LAST_NAME                = $name.LAST_NAME
            FIRST_NAME               = $name.FIRST_NAME
            MIDDLE_INITIAL           = $name.MIDDLE_INITIAL
            DEGREE                   = $name.DEGREE
            EMAILADDRESS             = $name.EMAILADDRESS
            PRIMARY_CLINIC_PHONE     = $name.PRIMARY_CLINIC_PHONE
            SPECIALTY_NAME           = $name.SPECIALTY_NAME
            State_License            = $name.State_License
            NPI_Number               = $name.NPI_Number
            'University Affiliation' = $name.'University Affiliation'
            Teaching_Title           = $name.Teaching_Title
            FACILITY                 = $userMatch
        }
    }
}


$UserOutput | Select-Object ID_NUMBER, PRACT_ID, LAST_NAME, FIRST_NAME, MIDDLE_INITIAL, DEGREE, EMAILADDRESS, PRIMARY_CLINIC_PHONE, SPECIALTY_NAME, State_License, NPI_Number, 'University Affiliation', Teaching_Title, FACILITY |
Export-Csv $logs -NoTypeInformation

Does the second file have headers? Can you add the input files (or at least a part) in the question? — iRon, Apr 03 '20 at 19:43
Also have a look at: [Compare IPAddress column from csv1 and csv2 and if matches get the uptime column from csv2](https://stackoverflow.com/a/60978188/1701026) — iRon, Apr 03 '20 at 19:49

Mathias R. Jessen · Accepted Answer · 2020-04-03T16:35:05.483

2

Load $File2 into a hashtable with the $_.Split('|')[0] value as the key - you can then also skip the object creation completely and offload everything to Select-Object:

$File2 = Get-Content "C:\File2.csv" | Select-Object -Skip 1 | Sort-Object 

# load $file2 into hashtable
$userTable = @{}
foreach($userEntry in $File2){
    $userTable[$userEntry.Split('|')[0]] = $userEntry
}

# prepare the existing property names we want to preserve
$propertiesToSelect = 'ID_NUMBER', 'PRACT_ID', 'LAST_NAME', 'FIRST_NAME', 'MIDDLE_INITIAL', 'DEGREE', 'EMAILADDRESS', 'PRIMARY_CLINIC_PHONE', 'SPECIALTY_NAME', 'State_License', 'NPI_Number', 'University Affiliation', 'Teaching_Title'

# read file, filter on existence in $userTable, add the FACILITY calculated property before export
Import-csv "C:\File1.csv" -Delimiter '|' |Where-Object {$userTable.ContainsKey($_.PRACT_ID)} |Select-Object $propertiesToSelect,@{Name='FACILITY';Expression={$userTable[$_.PRACT_ID]}} |Export-Csv $logs -NoTypeInformation

edited Apr 03 '20 at 16:35

answered Apr 03 '20 at 16:23

Mathias R. Jessen

157,619
12
148
206

Get-Content is horrendously slow on large files. If you're having him read the raw data, he should use [System.Io.File]::ReadAllLines() – thepip3r Apr 03 '20 at 16:27
@thepip3r `ReadAllLines` still reads all lines into memory at once. I would go with `ReadLines` instead to read the file in batches. More information at [stackoverflow.com/questions/21969851/what-is-the-difference-between-file-readlines-and-file-readalllines](https://stackoverflow.com/questions/21969851/what-is-the-difference-between-file-readlines-and-file-readalllines) – RoadRunner Apr 03 '20 at 16:32
@thepip3r good point, but might as well pipe `Import-Csv` directly in this case – Mathias R. Jessen Apr 03 '20 at 16:35
@RoadRunner, the slowness with GC, in this case, isn't the fact that it reads all lines into memory, it's that it's terribly slow on large files when compared to [File]::ReadAllLines(). – thepip3r Apr 03 '20 at 18:50
@RoadRunner, I appear to have mistaken with modern PowerShell... see the edit on my answer. – thepip3r Apr 03 '20 at 19:44
I'm afraid I'm not good with the C# stuff. At least I don't fully understand how to access those in the correct fashion. So, leaning towards Mathias if we can iron out that error. Or maybe since File2 uses PRACT_ID as secondary key, containskey might not be possible? – MarcGel Apr 03 '20 at 22:48

thepip3r · Answer 2 · 2020-04-03T19:44:04.660

There are a multitude of ways to increase the speed of the operations you're doing and can be broken down to in-script and out-of-script-possibilities:

Out of Script Possibilities:

Since the files are large, how much memory does the machine you're running this on have? And are you maxing it out during this operation?

If you are paging to disk, this will be the single biggest impact to the overall process!

If you are, the two ways to address this are:

Throw more hardware at the problem (easiest to deal with)
Write your code to iterate over each file small chunks at a time so you don't load it all into RAM at the same time. (very difficult if you're not familiar with it)

In-Script:

Dont use @() with += (it's really slow (especially over large datasets))

Use instead an ArrayList. Here is a quick sample of the perf difference (ArrayList ~40x faster on 10,000 and 500x faster on 100,000 entries, consistently -- this difference gets larger as the dataset gets larger or in other words, @() += gets slower as the dataset gets bigger)):

(Measure-Command {
    $arr = [System.Collections.ArrayList]::new()

    1..100000 | % {
        [void]$arr.Add($_)
    }
}).TotalSeconds

(Measure-Command {
    $arr = @()
    1..100000 | % {
        $arr += $_
    }
}).TotalSeconds

0.8258113
451.5413987

If you need to do multiple key-based lookups on the data, iterating over the data millions of times will be slow. Import the data as a CSV and then structure a couple of hashtables with the associated information with key -> data and/or key -> data[] and then you can do index lookups instead of iterating through the arrays millions of times... will be MUCH faster; assuming you have available RAM for the extra objects..

EDIT for @RoadRunner:

My experience with GC may be old... it used to be horrendously slow on large files but appears in newer PowerShell versions, may have been fixed:

[System.IO.File]::WriteAllLines("$($Env:UserProfile)\Desktop\10MB.txt", ('8' * 10MB))
[System.IO.File]::WriteAllLines("$($Env:UserProfile)\Desktop\50MB.txt", ('8' * 50MB))
[System.IO.File]::WriteAllLines("$($Env:UserProfile)\Desktop\100MB.txt", ('8' * 100MB))
[System.IO.File]::WriteAllLines("$($Env:UserProfile)\Desktop\500MB.txt", ('8' * 500MB))

$10MB  = gi .\10MB.txt
$50MB  = gi .\50MB.txt
$100MB = gi .\100MB.txt
$500MB = gi .\500MB.txt

0..10 | % { 
    $n = [pscustomobject] @{
        'GC_10MB'    = (Measure-Command { Get-Content $10MB }).TotalSeconds
        'RAL_10MB'   = (Measure-Command { [System.IO.File]::ReadAllLines($10MB) }).TotalSeconds
        'GC_50MB'    = (Measure-Command { Get-Content $50MB }).TotalSeconds
        'RAL_50MB'   = (Measure-Command { [System.IO.File]::ReadAllLines($50MB) }).TotalSeconds
        'GC_100MB'   = (Measure-Command { Get-Content $100MB }).TotalSeconds
        'RAL_100MB'  = (Measure-Command { [System.IO.File]::ReadAllLines($100MB) }).TotalSeconds
        'GC_500MB'   = (Measure-Command { Get-Content $500MB }).TotalSeconds
        'RAL_500MB'  = (Measure-Command { [System.IO.File]::ReadAllLines($500MB) }).TotalSeconds
        'Delta_10MB'  = $null
        'Delta_50MB'  = $null
        'Delta_100MB' = $null
        'Delta_500MB' = $null
    }

    $n.Delta_10MB  = "{0:P}" -f ($n.GC_10MB / $n.RAL_10MB)
    $n.Delta_50MB  = "{0:P}" -f ($n.GC_50MB / $n.RAL_50MB)
    $n.Delta_100MB = "{0:P}" -f ($n.GC_100MB / $n.RAL_100MB)
    $n.Delta_500MB = "{0:P}" -f ($n.GC_500MB / $n.RAL_500MB)

    $n
}

Hi thepip3r, Can you help me understand where to put this into my script? Thx — MarcGel, Apr 03 '20 at 16:42
Hi Mathias, I got;Select-Object : Cannot convert System.Object[] to one of the following types {System.String, System.Management.Automation.ScriptBlock}. on yours. — MarcGel, Apr 03 '20 at 16:43
For more clarity, File1 always has 1 PRACT_ID per user. File2 can have many PRACT_IDs assigned to a user in File1, each a different facility. Thx — MarcGel, Apr 03 '20 at 17:29
@MarcGel, How much RAM on the machine are you using to run this? Are you ever getting close to 100% of it's use when you run this? — thepip3r, Apr 03 '20 at 19:33
16 GB RAM on a VM. Usually have 4 GB available when running this. — MarcGel, Apr 03 '20 at 20:38

PowerShell Slowness Two Large CSV Files

2 Answers2