4

I am merging a lot of large CSV files, e.g. while skipping the leading junk and appending the filename to each line:

Get-ChildItem . | Where Name -match "Q[0-4]20[0-1][0-9].csv" | 
Foreach-Object {
    $file = $_.BaseName
    Get-Content $_.FullName | select-object -skip 3 | % {
        "$_,${file}" | Out-File -Append temp.csv -Encoding ASCII
    }
}

In PowerShell this is incredibly slow even on an i7/16GB machine (~5 megabyte/minute). Can I make it more efficient or should I just switch to e.g. Python?

P.Windridge
  • 246
  • 2
  • 11
  • 4
    I think your bottleneck here is that you are using `Out-File -Append`. i.e. writing to disk for each row in your csv. Better approach would be to load all files into memory, do what you need to do, and output the merged csv in one go. – G42 Nov 17 '17 at 11:17
  • Ah thanks, would Add-Content do this? I'll give it a go – P.Windridge Nov 17 '17 at 11:23
  • Add-Content made no difference to speed but dropping the append filename to each line made it much faster, so I think that was the bottle neck (but your comment about doing everything in RAM surely is correct.. unfortunately I may not be able to do this ultimately) – P.Windridge Nov 17 '17 at 12:05
  • 2
    If your files are that big you will get better gains by using streamreader. `Get-Content` starts to suck with larger files. – Matt Nov 17 '17 at 13:19

3 Answers3

4

Get-Content / Set-Content are terrible with larger files. Streams are a good alternative when performance is key. So with that in mind lets use one to read in each file and another to write out the results.

$rootPath = "C:\temp"
$outputPath = "C:\test\somewherenotintemp.csv"
$streamWriter = [System.IO.StreamWriter]$outputPath
Get-ChildItem $rootPath -Filter "*.csv" -File  | ForEach-Object{
    $file = $_.BaseName
    [System.IO.File]::ReadAllLines($_.FullName) | 
        Select-Object -Skip 3 | ForEach-Object{
            $streamWriter.WriteLine(('{0},"{1}"' -f $_,$file))
    }
}
$streamWriter.Close(); $streamWriter.Dispose()

Create a writing stream $streamWriter to output the edited lines in each file. We could read in the file and write the file in larger batches, which would be faster, but we need to ignore a few lines and make changes to each one so processing line by line is simpler. Avoid writing anything to console during this time as it will just slow everything down.

What '{0},"{1}"' -f $_,$file does is quote that last "column" that is added in case the basename contains spaces.

mklement0
  • 382,024
  • 64
  • 607
  • 775
Matt
  • 45,022
  • 8
  • 78
  • 119
  • Nice; two things worth noting: due to .NET usually seeing a _different_ working dir., a _full_ path should always be used (which is what you're doing). You don't need _both_ `.Close()` and `.Dispose()` - either one will do (`.Close()` calls `.Dispose()`, which does all the work). – mklement0 Jun 14 '22 at 14:34
2
Measure-Command -Expression {
    Get-ChildItem C:\temp | Where Name -like "*.csv" | ForEach-Object {
        $file = $_.BaseName
        Get-Content $_.FullName | select-object -Skip 3 | ForEach-Object {
            "$_,$($file)" | Out-File -Append C:\temp\t\tempe1.csv -Encoding ASCII -Force
        }
    }
} # TotalSeconds      : 12,0526802 for 11415 lines

If you first put everything into an array in memory, things go a lot faster:

Measure-Command -Expression {
    $arr = @()
    Get-ChildItem C:\temp | Where Name -like "*.csv" | ForEach-Object {
        $file = $_.BaseName
        $arr += Get-Content $_.FullName | select-object -Skip 3 | ForEach-Object {
            "$_,$($file)" 
        }
    }
    $arr | Out-File -Append C:\temp\t\tempe2.csv -Encoding ASCII -Force
} # TotalSeconds      :  0,8197193  for 11415 lines

EDIT: Fixed it so that your filename was added to each row.

Clijsters
  • 4,031
  • 1
  • 27
  • 37
Snak3d0c
  • 626
  • 4
  • 11
  • Thanks, I'll check later. Even my test case has 1.5m rows, so if doing more in RAM is necessary I might need some extra batching – P.Windridge Nov 17 '17 at 12:11
  • 6
    Don't do `$arr += ...` as that will be a huge performance hog in this setup. You are already using the pipeline. Just remove that logic all together and send it down to `out-file`/ `set-content` right away. – Matt Nov 17 '17 at 13:18
  • [Why should I avoid using the increase assignment operator (`+=`) to create a collection](https://stackoverflow.com/a/60708579/1701026) – iRon Nov 17 '21 at 07:03
1

To avoid -Append to ruin the performance of your script you could use a buffer array variable:

# Initialize buffer
$csvBuffer = @()

Get-ChildItem *.csv | Foreach-Object {
    $file = $_.BaseName
    $content = Get-Content $_.FullName | Select-Object -Skip 3 | %{
        "$_,${file}" 
    }

    # Populate buffer
    $csvBuffer += $content

    # Write buffer to disk if it contains 5000 lines or more
    $csvBufferCount = $csvBuffer | Measure-Object | Select-Object -ExpandProperty Count
    if( $csvBufferCount -ge 5000 )
    {
        $csvBuffer | Out-File -Path temp.csv -Encoding ASCII -Append
        $csvBuffer = @()
    }
}

# Important : empty the buffer remainder
if( $csvBufferCount -gt 0 )
{
    $csvBuffer | Out-File -Path temp.csv -Encoding ASCII -Append
    $csvBuffer = @()
}
Chris
  • 935
  • 3
  • 10
  • 21