1

I have a large CSV file in which some fields have a new line embedded. Excel 2016 produces errors when importing a CSV with rows which have fields with a new line embedded.

Based on this post, I wrote code to replace any new line in any field with a space. Below is a code block that duplicates the functionality and issue. Option 1 works. Option 2, which is commented out, casts my object to a string. I was hoping Option 2 might run faster.

Question: Is there a better way to do this to optimize for performance processing very large files?

$array = @([PSCustomObject]@{"ID"="1"; "Name"="Joe`nSmith"},
           [PSCustomObject]@{"ID"="2"; "Name"="Jasmine Baker"})

$array = $array | ForEach-Object {
    #Option 1: produces an Object, but is code optimized?
    foreach ($n in $_.PSObject.Properties.Name) {
        $_.PSObject.Properties[$n].Value = ` 
             $_.PSObject.Properties[$n].Value -replace "`n"," "
    }

    #Option 2: produces a string, not an object
    #$_ = $_ -replace "`n"," " 

    $_
 }

Keep in mind that in my real-world use case, each row has > 15 fields and any combination of them may have one or more new lines embedded.

D England
  • 13
  • 4
  • I read somewhere that ForEach is faster than ForEach-Object. You could do some testing of both scenarios you mention with the stopwatch class. Utilize it like so: $resultstime = new-timespan -Minutes 1 $resultstime = [diagnostics.stopwatch]::StartNew() $resultstime.Elapsed Remember to stop it :) – Martin Sandgaard Rasmussen May 29 '17 at 14:39
  • Thanks Martin. I'll use the stopwatch class and time my original code compared with the suggestions from @wOxxOm. – D England May 29 '17 at 21:57
  • Please do share your findings :) – Martin Sandgaard Rasmussen May 30 '17 at 07:39
  • Using my data set which is about 16.5 MB and 25,000 rows, the ForEach-Object loop ran in 36 seconds. The foreach loop ran in 11. – D England May 30 '17 at 13:12

1 Answers1

2
  • Use the fast TextFieldParser to read, process, and build the CSV from the file (PowerShell 3+):

    [Reflection.Assembly]::LoadWithPartialName('Microsoft.VisualBasic') >$null
    $parser = New-Object Microsoft.VisualBasic.FileIO.TextFieldParser 'r:\1.csv'
    $parser.SetDelimiters(',')
    $header = $parser.ReadFields()
    
    $CSV = while (!$parser.EndOfData) {
        $i = 0
        $row = [ordered]@{}
        foreach ($field in $parser.ReadFields()) {
            $row[$header[$i++]] = $field.replace("`n", ' ')
        }
        [PSCustomObject]$row
    }
    
  • Or modify each field in-place in an already existing CSV array:

    foreach ($row in $CSV) {
        foreach ($field in $row.PSObject.Properties) {
            $field.value = $field.value.replace("`n", ' ')
        }
    }
    

Notes:

  1. foreach statement is much faster than piping to ForEach-Object (also aliased as foreach)
  2. $stringVariable.replace() is faster then -replace operator
wOxxOm
  • 65,848
  • 11
  • 132
  • 136
  • Thanks for the detailed reply. I'll try both of these approaches and see which is faster. – D England May 29 '17 at 21:57
  • foreach is definitely faster than Foreach-Object. the replace loop takes 36 seconds with ForEach-Object and only 11 seconds with a foreach loop. – D England May 29 '17 at 22:26
  • I also give up on trying to use -replace on an object. It seems to only turn the result into a string. 11 seconds is plenty fast for a twice weekly script. – D England May 29 '17 at 22:28
  • Just to link questions, I've written a [similar answer](https://stackoverflow.com/a/67029635/7571258) for a different question, where I wrapped `TextFieldParser` in a reusable function. – zett42 Apr 10 '21 at 10:25