1

In a PowerShell script, I read a CSV file.

I have to "fix" some values. Specifically, the CSV may contains either an empty value, either literally NULL or sometimes -. All these values are to be considered as $null.

Is there a way to intercept CSV parsing to handle that?

Actually I have a working solution, but the solution is awfully slow. Iterating over 2500 items takes 20 minutes, while reading as is the CSV file takes only a few seconds.

The idea is to iterate over each properties :

$private:result = @{}
foreach($private:prop in $private:line.PSObject.Properties){
    $private:value = $null
    $private:result.Add($private:prop.Name, ($private:value | Filter-Value))
}
$private:result
...

function Filter-Value{
    param(
        [Parameter(Position=0, ValueFromPipeline=$true)]
        [object]$In
    )

    if(-not $In){
        $null
    }
    elseif(($In -is [string]) -and ($In.Length -eq 0)) {
        $null
    }
    elseif(($In -eq "NULL") -or ($In -eq "-")) {
        $null
    }
    else{
        $In
    }
}

Full code :


function Import-CsvEx{
    param(
        [Parameter(Mandatory=$true, Position=0, ValueFromPipeline=$true)]
        [ValidateScript({Test-Path $_ -PathType Leaf})]
        [string]$Path,
        [Parameter()]
        [string]$Delimiter
    )
    begin{
        Write-Verbose "Begin read of file $Path"
    }
    process{
        # We use file stream and stream reader to automatically detect encoding
        $private:fileStream = [System.IO.File]::OpenRead($Path)

        $private:streamReader = New-Object System.IO.StreamReader($private:fileStream, [System.Text.Encoding]::Default, $true)

        $private:fileContent = $private:streamReader.ReadToEnd()

        $private:streamReader.Dispose()
        $private:fileStream.Dispose()        

        $private:csv = ConvertFrom-Csv $private:fileContent  -Delimiter $Delimiter

        for($private:i=0; $private:i -lt $private:csv.Count ; $private:i++){
            Write-Progress -Id 1003 -Activity "Reading  CSV" -PercentComplete ($private:i*100/$private:csv.count)
            $private:line = $private:csv[$private:i]
            $private:result = @{}
            foreach($private:prop in $private:line.PSObject.Properties){
                $private:value = $null
                $private:result.Add($private:prop.Name, ($private:value | Filter-Value))
            }

            # actually outputs the object to the pipeline
            New-Object psobject -Property $private:result

        }
        Write-Progress -Id 1003 -Activity "Reading CSV" -Completed

    }
    end{
        Write-Verbose "End read of file $Path"
    }
}

function Filter-Value{
    param(
        [Parameter(Position=0, ValueFromPipeline=$true)]
        [object]$In
    )

    if(-not $In){
        $null
    }
    elseif(($In -is [string]) -and ($In.Length -eq 0)) {
        $null
    }
    elseif(($In -eq "NULL") -or ($In -eq "-")) {
        $null
    }
    else{
        $In
    }
}

mklement0
  • 382,024
  • 64
  • 607
  • 775
Steve B
  • 36,818
  • 21
  • 101
  • 174

3 Answers3

4

Given that performance is the concern:

  • Avoid the pipeline (though at the cost of having to fit all data into memory).

  • Avoid use of Write-Progress.

  • Avoid repeated reflection via .psobject.Properties.

As an aside: Use of the $private: scope is rarely necessary and makes your code hard to read; note that assigning to variables by mere name inside a function implicitly creates local variables (e.g., $var = 42); you would only need $private: if you needed to explicitly prevent descendant scopes from seeing these variables - see this answer for more information.

# Import the CSV data into a collection in memory.
# NOTE: In Windows PowerShell, Import-Csv defaults to ASCII(!) encoding.
#       Use -Encoding Default to use the system's ANSI code page, for instance.
#       PowerShell [Core] 6+ consistently defaults to (BOM-less) UTF-8.
$objects = Import-Csv $Path -Delimiter $Delimiter

# Extract the property (column) names from the 1st imported object.
$propNames = $objects[0].psobject.Properties.Name

# Loop over all objects...
foreach ($object in $objects) {

  # ... and make the quasi-null properties $null.
  foreach ($propName in $propNames) {
    if ($object.$propName -in '', '-', 'NULL') {
      $object.$propName = $null
    }
  }

  # Output the modified object right away, if desired.
  # Alternatively, operate on the $objects collection later.
  $object

}

If you cannot fit all data into memory, use Import-Csv ... | ForEach-Object { ... }, while still only extracting the property names in the first invocation of the script block ({ ... }).

mklement0
  • 382,024
  • 64
  • 607
  • 775
  • Regarding the `$private` scope, I have [faced scoping issues](https://stackoverflow.com/questions/58319077/avoid-variable-in-module-to-be-overriden-by-caller-and-vice-versa). The Import-CsvEX function is part of a reusable module, and I need to avoid scoping issues at all cost. – Steve B Mar 11 '20 at 07:51
  • I updated my code. I know have quite fast processing of the CSV file. I kept the Write-Progress, but changed the code to update the initial object instead of emitting a new one. I also move the reflection out of the for loop. Everything is working well know. Thanks ! – Steve B Mar 11 '20 at 08:35
  • Glad to hear it, @SteveB; my pleasure. Regarding the scoping issues: ensuring that all local variables are initialized and thereby explicitly created is sufficient, for which you don't need the `$private:` scope, which makes no difference in your function, because no child scopes are crated; see [my answer](https://stackoverflow.com/a/60632546/45375) to the linked post. – mklement0 Mar 11 '20 at 09:27
3

This is a perfect use-case for a filter - a pipeline-friendly function that only implements the process block:

filter Parse-Null {
  # iterate over all properties, look for "null-like" values, replace with empty string
  foreach($prop in $_.psobject.Properties){
    if($prop.Value -in '-','NULL'){
      $prop.Value = ''
    }
  }

  # pass the (potentially modified) object along
  $_
}

Then use like:

$csvData = @'
h1,h2,h3
NULL,something,-
,-,someother
'@
$csvData |ConvertFrom-Csv |Parse-Null
# or
Import-Csv ... |Parse-Null
Mathias R. Jessen
  • 157,619
  • 12
  • 148
  • 206
  • Not sure to get the idea... how should that speed up the process ? There's still the iteration over all properties (which is I think the hot point) – Steve B Mar 10 '20 at 17:18
  • @SteveB it is indeed, but you're not going to get around that part - the difference here is that I modify the existing object instead of creating a new one :) – Mathias R. Jessen Mar 10 '20 at 17:19
  • Updating the object instead of emitting a new one is very benefic to the global speed. Thanks for that – Steve B Mar 11 '20 at 08:37
0

I like to Import-CSV, manipulate and then Export-CSV

C:\> $ted = import-csv -Path ted.csv
C:\> $ted

Name Desc
---- ----
ted
fred Dash-Dash
ned  NULL


C:\> $ted | ? { $_.Desc -match 'NULL|-|""' -or $_.Desc.Length -eq 0 -or $Null -eq $_.Desc} | %
 {$_.Desc = "In" }
C:\> $ted

Name Desc
---- ----
ted  In
fred In
ned  In

C:\> Export-CSV -Path ted.csv -NoTypeInformation
midacts
  • 196
  • 6
  • Yes, `Import-Csv` is an improvement over the OP's file stream + `ConvertFrom-Csv` approach, but note that a requirement is to _loop over all properties_ (columns) to check for values to fix, and to do so fast. – mklement0 Mar 10 '20 at 19:56
  • File Stream approach was required to handle encoding issue. using this consctructor, it matches the system default encoding where Import-Csv from a file use ASCII encoding. – Steve B Mar 11 '20 at 08:38
  • @SteveB, `Import-Csv` has an `-Encoding` parameter, as all file-processing standard cmdlets do; use `-Encoding Default` to use the system's ANSI code page. Windows PowerShell is, unfortunately, wildly inconsistent with respect to default character encodings, unlike PowerShel [Core] 6+, which now commendably consistently defaults to BOM-less UTF-8 - see the bottom section of [this answer](https://stackoverflow.com/a/40098904/45375). – mklement0 Mar 12 '20 at 18:02