0

I need some help here, just to put things in context, I am new to PowerShell, I have a task, that in simple terms, would take a csv with more than 2Million records (coming from BigFix) and a lot of columns, break it into multiple CSVs, by selecting specific columns, so the code below is my attempt to get this done, and the CSVs created will be zipped.Issues, with only 200 thousand records, this took about 4hrs, so first i don't know if there's a way to Import-Csv once rather than importing it each time i have to select different columns for output? Besides the copying task in the beginning (need to be first) and the Zipping need to be after all the CSVs are created, the rest can run at the same time (i don't know how to do that) Thanks for any help

$filePath = "C:\location2\powerShellTesting\Input\bigFixDataNew.csv"

Copy-Item "\\location1\20191213_BFI_SAMPLE_DATA_csv.csv" -Destination $filePath




$System = "..\Output\System.csv"
$AddRemove = "..\Output\AddRemove.csv"
$GS_PC_BIOS = "..\Output\GS_PC_BIOS.csv"
$GS_PROCESSOR = "..\Output\GS_PROCESSOR.csv"
$GS_LOGICAL_DISK = "..\Output\GS_LOGICAL_DISK.csv"
$GS_X86_PC_MEMORY = "..\Output\GS_X86_PC_MEMORY.csv"
$GS_COMPUTER_SYSTEM = "..\Output\GS_COMPUTER_SYSTEM.csv"
$GS_OPERATING_SYSTEM = "..\Output\GS_OPERATING_SYSTEM.csv"
$GS_WORKSTATION_STATUS = "..\Output\GS_WORKSTATION_STATUS.csv"



$desiredColumnsAddRemove = @{ expression = {$_.'Internal Computer ID'}; label = 'RESOURCEID' },
@{ expression = {$_.'Component Name'}; label ='DISPLAYNAME'},
@{ expression = {$_.'Product Version'}; label = 'VERSION'},
@{ expression = {$_.'Publisher Name'}; label = 'PUBLISHER'},
@{ expression = {$_.'Creation'}; label = 'INSTALLDATE'}

$desiredColumnsGS_COMPUTER_SYSTEM = @{ expression = {$_.'Internal Computer ID'}; label = 'RESOURCEID' },
@{ expression = {$_.'Server Vendor'}; label = 'MANUFACTURER0'},
@{ expression = {$_.'Server Model'}; label = 'MODEL0'},
@{ expression = {$_.'Partition Virtual Processors'}; label = 'NUMBEROFPROCESSORS0'}

$desiredColumnsGS_OPERATING_SYSTEM = @{ expression = {$_.'Internal Computer ID'}; label = 'RESOURCEID' },
@{ expression = {$_.'Operating System'}; label = 'NAME0'},
@{ expression = {$_.'Operating System'}; label = 'CAPTION0'}

$desiredColumnsGS_WORKSTATION_STATUS = @{ expression = {$_.'Internal Computer ID'}; label = 'RESOURCEID'},
@{ expression = {$_.'Last Scan Attempt'}; label = 'LASTHWSCAN'}

$desiredColumnsSystem = @{ expression = {$_.'Internal Computer ID'}; label = 'RESOURCEID' },
@{ expression = {$_.'DNS Name'}; label = 'NAME'},
@{ expression = {$_.'User Name'}; label = 'USER_NAME'}

$desiredColumnsGS_X86_PC_MEMORY = @{ expression = {$_.'Internal Computer ID'}; label = 'RESOURCEID' }

$desiredColumnsGS_PROCESSOR = @{ expression = {$_.'Internal Computer ID'}; label = 'RESOURCEID' },
@{ expression = {$_.'Vendor'}; label = 'MANUFACTURER0'},
@{ expression = {$_.'Processor Brand String'}; label = 'NAME0'}

$desiredColumnsGS_PC_BIOS = @{ expression = {$_.'Internal Computer ID'}; label = 'RESOURCEID' },
@{ expression = {$_.'Server Vendor'}; label = 'MANUFACTURER0'},
@{ expression = {$_.'Server Serial Number'}; label = 'SERIALNUMBER0'}

$desiredColumnsGS_LOGICAL_DISK = @{ expression = {$_.'Internal Computer ID'}; label = 'RESOURCEID' }




Import-Csv $filePath | Select $desiredColumnsGS_X86_PC_MEMORY -Unique |
Export-Csv -Path $GS_X86_PC_MEMORY –NoTypeInformation

Import-Csv $filePath | Select $desiredColumnsGS_PROCESSOR -Unique |
Export-Csv -Path $GS_PROCESSOR –NoTypeInformation

Import-Csv $filePath | Select $desiredColumnsGS_PC_BIOS -Unique |
Export-Csv -Path $GS_PC_BIOS –NoTypeInformation

Import-Csv $filePath | Select $desiredColumnsGS_LOGICAL_DISK -Unique |
Export-Csv -Path $GS_LOGICAL_DISK –NoTypeInformation

Import-Csv $filePath | Select $desiredColumnsGS_OPERATING_SYSTEM -Unique |
Export-Csv -Path $GS_OPERATING_SYSTEM –NoTypeInformation

Import-Csv $filePath | Select $desiredColumnsGS_WORKSTATION_STATUS -Unique |
Export-Csv -Path $GS_WORKSTATION_STATUS –NoTypeInformation

Import-Csv $filePath | Select $desiredColumnsSystem -Unique |
Export-Csv -Path $System –NoTypeInformation

Import-Csv $filePath | Select $desiredColumnsGS_COMPUTER_SYSTEM -Unique |
Export-Csv -Path $GS_COMPUTER_SYSTEM –NoTypeInformation

Import-Csv $filePath | Select $desiredColumnsAddRemove |
Export-Csv -Path $AddRemove –NoTypeInformation



# Creating the Zip File
$compress = @{
    Path = "..\Output\AddRemove.csv",
    "..\Output\GS_COMPUTER_SYSTEM.csv" ,
    "..\Output\GS_OPERATING_SYSTEM.csv",
    "..\Output\GS_WORKSTATION_STATUS.csv",
    "..\Output\System.csv",
    "..\Output\GS_X86_PC_MEMORY.csv",
    "..\Output\GS_PROCESSOR.csv",
    "..\Output\GS_PC_BIOS.csv",
    "..\Output\GS_LOGICAL_DISK.csv"

    CompressionLevel = "Fastest"
    DestinationPath = "..\Output\BigFix.Zip"
}
Compress-Archive @compress -Force
Lance U. Matthews
  • 15,725
  • 6
  • 48
  • 68

2 Answers2

0
$filePath = "C:\location2\powerShellTesting\Input\bigFixDataNew.csv"

Copy-Item "\\location1\20191213_BFI_SAMPLE_DATA_csv.csv" -Destination $filePath




$System = "..\Output\System.csv"
$AddRemove = "..\Output\AddRemove.csv"
$GS_PC_BIOS = "..\Output\GS_PC_BIOS.csv"
$GS_PROCESSOR = "..\Output\GS_PROCESSOR.csv"
$GS_LOGICAL_DISK = "..\Output\GS_LOGICAL_DISK.csv"
$GS_X86_PC_MEMORY = "..\Output\GS_X86_PC_MEMORY.csv"
$GS_COMPUTER_SYSTEM = "..\Output\GS_COMPUTER_SYSTEM.csv"
$GS_OPERATING_SYSTEM = "..\Output\GS_OPERATING_SYSTEM.csv"
$GS_WORKSTATION_STATUS = "..\Output\GS_WORKSTATION_STATUS.csv"



$desiredColumnsAddRemove = @{ expression = {$_.'Internal Computer ID'}; label = 'RESOURCEID' },
@{ expression = {$_.'Component Name'}; label ='DISPLAYNAME'},
@{ expression = {$_.'Product Version'}; label = 'VERSION'},
@{ expression = {$_.'Publisher Name'}; label = 'PUBLISHER'},
@{ expression = {$_.'Creation'}; label = 'INSTALLDATE'}

$desiredColumnsGS_COMPUTER_SYSTEM = @{ expression = {$_.'Internal Computer ID'}; label = 'RESOURCEID' },
@{ expression = {$_.'Server Vendor'}; label = 'MANUFACTURER0'},
@{ expression = {$_.'Server Model'}; label = 'MODEL0'},
@{ expression = {$_.'Partition Virtual Processors'}; label = 'NUMBEROFPROCESSORS0'}

$desiredColumnsGS_OPERATING_SYSTEM = @{ expression = {$_.'Internal Computer ID'}; label = 'RESOURCEID' },
@{ expression = {$_.'Operating System'}; label = 'NAME0'},
@{ expression = {$_.'Operating System'}; label = 'CAPTION0'}

$desiredColumnsGS_WORKSTATION_STATUS = @{ expression = {$_.'Internal Computer ID'}; label = 'RESOURCEID'},
@{ expression = {$_.'Last Scan Attempt'}; label = 'LASTHWSCAN'}

$desiredColumnsSystem = @{ expression = {$_.'Internal Computer ID'}; label = 'RESOURCEID' },
@{ expression = {$_.'DNS Name'}; label = 'NAME'},
@{ expression = {$_.'User Name'}; label = 'USER_NAME'}

$desiredColumnsGS_X86_PC_MEMORY = @{ expression = {$_.'Internal Computer ID'}; label = 'RESOURCEID' }

$desiredColumnsGS_PROCESSOR = @{ expression = {$_.'Internal Computer ID'}; label = 'RESOURCEID' },
@{ expression = {$_.'Vendor'}; label = 'MANUFACTURER0'},
@{ expression = {$_.'Processor Brand String'}; label = 'NAME0'}

$desiredColumnsGS_PC_BIOS = @{ expression = {$_.'Internal Computer ID'}; label = 'RESOURCEID' },
@{ expression = {$_.'Server Vendor'}; label = 'MANUFACTURER0'},
@{ expression = {$_.'Server Serial Number'}; label = 'SERIALNUMBER0'}

$desiredColumnsGS_LOGICAL_DISK = @{ expression = {$_.'Internal Computer ID'}; label = 'RESOURCEID' }


$csvfile = Import-Csv $filePath


$csvfile | Select $desiredColumnsGS_X86_PC_MEMORY -Unique |
Export-Csv -Path $GS_X86_PC_MEMORY –NoTypeInformation

$csvfile | Select $desiredColumnsGS_PROCESSOR -Unique |
Export-Csv -Path $GS_PROCESSOR –NoTypeInformation

$csvfile | Select $desiredColumnsGS_PC_BIOS -Unique |
Export-Csv -Path $GS_PC_BIOS –NoTypeInformation

$csvfile | Select $desiredColumnsGS_LOGICAL_DISK -Unique |
Export-Csv -Path $GS_LOGICAL_DISK –NoTypeInformation

$csvfile | Select $desiredColumnsGS_OPERATING_SYSTEM -Unique |
Export-Csv -Path $GS_OPERATING_SYSTEM –NoTypeInformation

$csvfile | Select $desiredColumnsGS_WORKSTATION_STATUS -Unique |
Export-Csv -Path $GS_WORKSTATION_STATUS –NoTypeInformation

$csvfile | Select $desiredColumnsSystem -Unique |
Export-Csv -Path $System –NoTypeInformation

$csvfile | Select $desiredColumnsGS_COMPUTER_SYSTEM -Unique |
Export-Csv -Path $GS_COMPUTER_SYSTEM –NoTypeInformation

$csvfile | Select $desiredColumnsAddRemove |
Export-Csv -Path $AddRemove –NoTypeInformation



# Creating the Zip File
$compress = @{
    Path = "..\Output\AddRemove.csv",
    "..\Output\GS_COMPUTER_SYSTEM.csv" ,
    "..\Output\GS_OPERATING_SYSTEM.csv",
    "..\Output\GS_WORKSTATION_STATUS.csv",
    "..\Output\System.csv",
    "..\Output\GS_X86_PC_MEMORY.csv",
    "..\Output\GS_PROCESSOR.csv",
    "..\Output\GS_PC_BIOS.csv",
    "..\Output\GS_LOGICAL_DISK.csv"

    CompressionLevel = "Fastest"
    DestinationPath = "..\Output\BigFix.Zip"
}
Compress-Archive @compress -Force

Instead of importing it so many times, import it once into a variable and then just manipulate the variable.

SureThing
  • 97
  • 7
0

The problem, of course, is that you are reading and parsing the file at $filePath once per output file whereas, ideally, it would be read and parsed once period. The temptation might be to just store the result of Import-Csv $filePath in a variable for reuse, but, as you found, that puts you at odds with the maximum size of a .NET array. Even if that weren't the case, you'd still be eating up a lot of memory while your script runs.

Instead of writing one output file at a time, we can read through $filePath just once by interspersing the data writes to each output file on a record-by-record basis. First, let's clean up the code defining which columns get output to which files...

$filePath = "C:\location2\powerShellTesting\Input\bigFixDataNew.csv"

Copy-Item "\\location1\20191213_BFI_SAMPLE_DATA_csv.csv" -Destination $filePath

$outputFileDescriptors = @(
    @{
        Path = "..\Output\System.csv"
        Columns = @(
            @{ expression = {$_.'Internal Computer ID'}; label = 'RESOURCEID' },
            @{ expression = {$_.'DNS Name'}; label = 'NAME'},
            @{ expression = {$_.'User Name'}; label = 'USER_NAME'}
        )
    },
    @{
        Path = "..\Output\AddRemove.csv"
        Columns = @(
            @{ expression = {$_.'Internal Computer ID'}; label = 'RESOURCEID' },
            @{ expression = {$_.'Component Name'}; label ='DISPLAYNAME'},
            @{ expression = {$_.'Product Version'}; label = 'VERSION'},
            @{ expression = {$_.'Publisher Name'}; label = 'PUBLISHER'},
            @{ expression = {$_.'Creation'}; label = 'INSTALLDATE'}
        )
    },
    @{
        Path = "..\Output\GS_PC_BIOS.csv"
        Columns = @(
            @{ expression = {$_.'Internal Computer ID'}; label = 'RESOURCEID' },
            @{ expression = {$_.'Server Vendor'}; label = 'MANUFACTURER0'},
            @{ expression = {$_.'Server Serial Number'}; label = 'SERIALNUMBER0'}
        )
    },
    @{
        Path = "..\Output\GS_PROCESSOR.csv"
        Columns = @(
            @{ expression = {$_.'Internal Computer ID'}; label = 'RESOURCEID' },
            @{ expression = {$_.'Vendor'}; label = 'MANUFACTURER0'},
            @{ expression = {$_.'Processor Brand String'}; label = 'NAME0'}
        )
    },
    @{
        Path = "..\Output\GS_LOGICAL_DISK.csv"
        Columns = @(
            @{ expression = {$_.'Internal Computer ID'}; label = 'RESOURCEID' }
        )
    },
    @{
        Path = "..\Output\GS_X86_PC_MEMORY.csv"
        Columns = @(
            @{ expression = {$_.'Internal Computer ID'}; label = 'RESOURCEID' }
        )
    },
    @{
        Path = "..\Output\GS_COMPUTER_SYSTEM.csv"
        Columns = @(
            @{ expression = {$_.'Internal Computer ID'}; label = 'RESOURCEID' },
            @{ expression = {$_.'Server Vendor'}; label = 'MANUFACTURER0'},
            @{ expression = {$_.'Server Model'}; label = 'MODEL0'},
            @{ expression = {$_.'Partition Virtual Processors'}; label = 'NUMBEROFPROCESSORS0'}
        )
    },
    @{
        Path = "..\Output\GS_OPERATING_SYSTEM.csv"
        Columns = @(
            @{ expression = {$_.'Internal Computer ID'}; label = 'RESOURCEID' },
            @{ expression = {$_.'Operating System'}; label = 'NAME0'},
            @{ expression = {$_.'Operating System'}; label = 'CAPTION0'}
        )
    },
    @{
        Path = "..\Output\GS_WORKSTATION_STATUS.csv"
        Columns = @(
            @{ expression = {$_.'Internal Computer ID'}; label = 'RESOURCEID'},
            @{ expression = {$_.'Last Scan Attempt'}; label = 'LASTHWSCAN'}         
        )
    } `
        | ForEach-Object -Process { [PSCustomObject] $_ }
)

$outputFileDescriptors will contain an array of [PSCustomObject] instances, each of which with Path and Columns properties defining that output file. At this point you could just rewrite the end of the script as simply...

foreach ($outputFileDescriptor in $outputFileDescriptors)
{
    Import-Csv $filePath | Select $outputFileDescriptor.Columns -Unique |
        Export-Csv -Path $outputFileDescriptor.Path -NoTypeInformation
}

# Creating the Zip File
Compress-Archive -Path ($outputFileDescriptors).Path -DestinationPath "..\Output\BigFix.Zip" `
    -CompressionLevel "Fastest" -Force

...but that contains no performance improvements compared to your original script; we're still calling Import-Csv once for each output file.

Instead, let's modify that loop like this...

foreach ($record in Import-Csv $filePath)
{
    foreach ($outputFileDescriptor in $outputFileDescriptors)
    {
        $record | Select $outputFileDescriptor.Columns |
            Export-Csv -Path $outputFileDescriptor.Path -NoTypeInformation -Append
    }
}

Now we're only calling Import-Csv once, and for each input record we are outputting the appropriate columns to each file. Most importantly, we only ever have a variable reference to one record at a time, reducing memory usage.

There are two other noteworthy changes here. First, we are passing -Append to Export-Csv; this is so the complete file doesn't get overwritten for each record. Second, we are not passing -Unique to Select-Object. We could, but it wouldn't do anything since in this case Select is only considering a single record and not the entire set of input data when evaluating uniqueness.

Unfortunately, Select ... -Unique can't be used for streaming output scenarios like this because it waits until it has evaluated all of the input data before it passes anything down the pipeline (it seems like it certainly could output a value the first time it encounters it, but evidently it doesn't). If you really do have redundant output data that is required to be filtered out, then you could keep track yourself of what data you've already seen...but collecting the data in memory pretty much puts us back where we started unless the amount of unique data is a small percentage of the entire data set and/or the need to remove redundant data is only really an issue for specific output files.

Lance U. Matthews
  • 15,725
  • 6
  • 48
  • 68
  • Dear @BACON I want to thank you for the extensive and detail response, The time you took to go through this Is just amazing, I am not really used to stackoverflow, thus my confusion between comments and answer, I will go through this and let you know how this works, I had in the meantime, broken my scripts in smaller pieces (not helping much on the performance) but I still have to go through yours since it is actually providing me knowledge for the future, again thanks – Roi Samagoue Dec 26 '19 at 16:41