35

Hello I'm looking for powershell script which would merge all csv files in a directory into one text file (.txt) . All csv files have same header which is always stored in a first row of every file. So I need to take header from the first file, but in rest of the files the first row should be skipped. I was able to find batch file which is doing exactly what I need, but I have more than 4000 csv files in a single directory and it takes more than 45 minutes to do the job.

@echo off
ECHO Set working directory
cd /d %~dp0
Deleting existing combined file
del summary.txt
setlocal ENABLEDELAYEDEXPANSION
set cnt=1
for %%i in (*.csv) do (
 if !cnt!==1 (
 for /f "delims=" %%j in ('type "%%i"') do echo %%j >> summary.txt
) else (
 for /f "skip=1 delims=" %%j in ('type "%%i"') do echo %%j >> summary.txt
 )
 set /a cnt+=1
 )

Any suggestion how to create powershell script which would be more efficient than this batch code?

Thank you.

John

john50
  • 417
  • 1
  • 5
  • 10

15 Answers15

74

If you're after a one-liner you can pipe each csv to an Import-Csv and then immediately pipe that to Export-Csv. This will retain the initial header row and exclude the remaining files header rows. It will also process each csv one at a time rather than loading all into memory and then dumping them into your merged csv.

Get-ChildItem -Filter *.csv | Select-Object -ExpandProperty FullName | Import-Csv | Export-Csv .\merged\merged.csv -NoTypeInformation -Append
stinkyfriend
  • 926
  • 7
  • 8
  • Is there a way for this to work with PowerShell version 2? It's the only version I have, and it doesn't include the -Append option in Export-Csv – rw2 Jul 10 '18 at 13:01
  • This is definitely the simplest solution - so long as all the source CSV files have the same set of columns in the same order. If the source files have different columns (or orders) and you want a superset file you'll need to pipe the Import-Csv output into a System.Data.DataTable, adding columns as you go, and pipe the final DataTable out to Export-Csv. – AlwaysLearning Oct 31 '18 at 23:02
  • 1
    This is the "true" PowerShell answer; other answers do not take advantage of key PowerShell functionalities – Zacharious Feb 07 '20 at 16:49
  • 1
    Is there a way to improve the performance of this? e.g. Multi-threading? Just tried to combine a hundred CSVs totalling 2.6 GB taking >30 mins and CPU/Disk usage never touched ~10% of max capacity, so it's neither cpu nor disk bound which means it's just doing everything in a single thread. – Aditya Anand Nov 27 '20 at 06:47
  • @AdityaAnand - I think multi-threading would introduce more problems - all the threads would be trying to append to `merged\merged.csv`. Maybe run the above in batches? i.e. try merge 10's rather than 100's of csvs. I've tried ~300 files totalling ~100mb merging into a ~500MB file and it does it in about 10 seconds. Also, make sure you are not trying to merge your merge file too, that wouldn't be desirable. – stinkyfriend Nov 29 '20 at 23:22
57

This will append all the files together reading them one at a time:

get-childItem "YOUR_DIRECTORY\*.txt" 
| foreach {[System.IO.File]::AppendAllText
 ("YOUR_DESTINATION_FILE", [System.IO.File]::ReadAllText($_.FullName))}

# Placed on seperate lines for readability

This one will place a new line at the end of each file entry if you need it:

get-childItem "YOUR_DIRECTORY\*.txt" | foreach
{[System.IO.File]::AppendAllText("YOUR_DESTINATION_FILE", 
[System.IO.File]::ReadAllText($_.FullName) + [System.Environment]::NewLine)}

Skipping the first line:

$getFirstLine = $true

get-childItem "YOUR_DIRECTORY\*.txt" | foreach {
    $filePath = $_

    $lines =  $lines = Get-Content $filePath  
    $linesToWrite = switch($getFirstLine) {
           $true  {$lines}
           $false {$lines | Select -Skip 1}

    }

    $getFirstLine = $false
    Add-Content "YOUR_DESTINATION_FILE" $linesToWrite
    }
kemiller2002
  • 113,795
  • 27
  • 197
  • 251
  • This code is almost doing exactly what I need. And it is quite fast but I need to read a header (first row) only from first file. In all other files the first row should be skipped. get-childItem . *.csv | foreach {[System.IO.File]::AppendAllText(".\summary.txt", [System.IO.File]::ReadAllText($_.FullName))} – john50 Jan 12 '15 at 01:50
  • This is great. Just what I wanted as well. – Amitabh Ghosh Jul 26 '21 at 15:38
12

Try this, it worked for me

Get-Content *.csv| Add-Content output.csv
DaFois
  • 2,197
  • 8
  • 26
  • 43
Anki
  • 129
  • 1
  • 4
6

This is pretty trivial in PowerShell.

$CSVFolder = 'C:\Path\to\your\files';
$OutputFile = 'C:\Path\to\output\file.txt';

$CSV = Get-ChildItem -Path $CSVFolder -Filter *.csv | ForEach-Object { 
    Import-Csv -Path $_
}

$CSV | Export-Csv -Path $OutputFile -NoTypeInformation -Force;

Only drawback to this approach is that it does parse every file. It also loads all files into memory, so if we're talking about 4000 files that are 100 MB each you'll obviously run into problems.

You might get better performance with System.IO.File and System.IO.StreamWriter.

Bacon Bits
  • 30,782
  • 5
  • 59
  • 66
  • Thank you for your answer. Could you please suggest how to implement System.IO.File and System.IO.StreamWriter into your code, because it takes forever to join 4000 files and skip first row from 3999 files. – john50 Jan 12 '15 at 01:55
  • Arrays are fixed length. If you want to add to a collection, use something like a List. https://theposhwolf.com/howtos/PS-Plus-Equals-Dangers/ – Zacharious Feb 07 '20 at 16:50
  • 2
    @Zachafer Thanks. I'm well aware of the issue, but this is an ancient answer. I have replaced the code with a better pattern. – Bacon Bits Feb 09 '20 at 01:47
  • If your files are on a network share, change `Import-Csv -Path $_` to become `Import-Csv -Path $_.FullName` or the script will think you're working w/ C: That happened to me. – shadow2020 Oct 20 '22 at 21:25
2

Your Batch file is pretty inefficient! Try this one (you'll be surprised :)

@echo off
ECHO Set working directory
cd /d %~dp0
ECHO Deleting existing combined file
del summary.txt
setlocal
for %%i in (*.csv) do set /P "header=" < "%%i" & goto continue
:continue

(
   echo %header%
   for %%i in (*.csv) do (
      for /f "usebackq skip=1 delims=" %%j in ("%%i") do echo %%j
   )
) > summary.txt

How this is an improvement

  1. for /f ... in ('type "%%i"') requires to load and execute cmd.exe in order to execute the type command, capture its output in a temporary file and then read data from it, and this is done with each input file. for /f ... in ("%%i") directly reads data from the file.
  2. The >> redirection opens the file, appends data at end and closes the file, and this is done with each output *line*. The > redirection keeps the file open all the time.
Sam Axe
  • 33,313
  • 9
  • 55
  • 89
Aacini
  • 65,180
  • 12
  • 72
  • 108
  • Do you think it would be worth it to explain the difference between yours and the OPs? – Matt Jan 12 '15 at 00:40
  • @Matt - Aacini's removes the need for a counter variable and logic-checking, giving the script fewer things to do inside the loop, making it faster. – SomethingDark Jan 12 '15 at 01:23
  • Thank you for your help, but for some reason it doesn't work.Error is: "Deleting is not recognized as internal or external command, operable program or batch file. I guess there should be ECHO command before "Deleting existing combined file". But it doesn't work even after I fixed it. There are just a couple of characters in summary file. – john50 Jan 12 '15 at 01:34
  • @Matt: The two most important differences are: **1.** `for /f ... in ('type "%%i"')` requires to load and execute cmd.exe in order to execute the `type` command, capture its output in a temporary file and then read data from it, and this is done _with each input file_. `for /f ... in ("%%i")` directly read data from the file. **2.** The `>>` redirection open the file, append data at end and close the file, and this is done _with each output *line*_. The `>` redirection keeps the file open all the time. – Aacini Jan 12 '15 at 19:10
2

If you need to scan folder recursively then you can use the approach below

Get-ChildItem -Recurse -Path .\data\*.csv  | Get-Content | Add-Content output.csv

what this basically does is:

  • Get-ChildItem -Recurse -Path .\data\*.csv Find the requested files recursively
  • Get-Content Get content for each
  • Add-Content output.csv append it to output.csv
Derviş Kayımbaşıoğlu
  • 28,492
  • 4
  • 50
  • 72
1

Here is a version also using System.IO.File,

$result = "c:\temp\result.txt"
$csvs = get-childItem "c:\temp\*.csv" 
#read and write CSV header
[System.IO.File]::WriteAllLines($result,[System.IO.File]::ReadAllLines($csvs[0])[0])
#read and append file contents minus header
foreach ($csv in $csvs)  {
    $lines = [System.IO.File]::ReadAllLines($csv)
    [System.IO.File]::AppendAllText($result, ($lines[1..$lines.Length] | Out-String))
}
Jan Chrbolka
  • 4,184
  • 2
  • 29
  • 38
  • Thank you for your answer but result.txt file is for some reason not in proper format.When I press F4 everything is put together. Also when I press F3 last line of one file is merged together with first line of a new file. – john50 Jan 12 '15 at 03:38
  • Just edited the code to insert a "NewLine" after each csv line. – Jan Chrbolka Jan 12 '15 at 04:19
  • Thank you very much. Now it works fine, but it's more than 2 times slower then Kevin's code. Unless somebody has more than couple hundreds of files in a directory it shouldn't matter. Thank you again. – john50 Jan 12 '15 at 04:46
  • I see, and I can see why, I was writing each individual line separately. If you have the time, try this code... (edited again) – Jan Chrbolka Jan 12 '15 at 05:00
  • My gut feeling was that calling .NET directly should be faster than “Get-content”/“Add-Content”, but I guess it is not. After testing both versions with a sample of 500 CSV files, “Get-content”/“Add-Content” wins hands down. This [System.IO.File] version: Elapsed Time: 2.254 seconds Kevin’s (“Get-content”/“Add-Content”) version Elapsed Time: 1.741 seconds – Jan Chrbolka Jan 12 '15 at 05:20
1
Get-ChildItem *.csv|select -First 1|Get-Content|select -First 1|Out-File -FilePath .\input.csv -Force #Get the header from one of the CSV Files, write it to input.csv
Get-ChildItem *.csv|foreach {Get-Content $_|select -Skip 1|Out-File -FilePath .\Input.csv -Append} #Get the content of each file, excluding the first line and append it to input.csv
1

stinkyfriend's helpful answer shows an elegant, PowerShell-idiomatic solution based on Import-Csv and Export-Csv.

Unfortunately,

  • it is quite slow because it involves ultimately unnecessary round-trip conversion to and from objects.

  • also, even though it shouldn't matter to a CSV parser, the specific format of the files can get altered in the process, because Export-Csv double-quotes all column values, invariably so in Windows PowerShell, by default in PowerShell (Core) 7+, which now offers opt-in control via -UseQuotes and -QuoteFields).

When performance matters, a plain-text solution is required, which also avoids any inadvertent format alteration (just like the linked answer it assumes that all input CSV files have the same column structure).

The following PSv5+ solution:

  • reads each input file's content into memory in full, as a single multi-line string, using Get-Content -Raw (which is much faster than the default line-by-line reading),
  • skips the header line for all but the first file with -replace '^.+\r?\n', using the regex-based -replace operator,
  • and saves the results to the target file with Set-Content -NoNewLine.

Character-encoding caveat:

  • PowerShell never preserves the input character encoding of files, so you may have to use the -Encoding parameter to override Set-Content's default encoding (the same applies to Export-Csv and any other file-writing cmdlets; in PowerShell (Core) 7+ all cmdlets now consistently default to BOM-less UTF-8; but not only do Windows PowerShell cmdlets not default to UTF-8, they use varying encodings - see the bottom section of this answer).
# Determine the output file and remove a preexisting one, if any.
$outFile = 'summary.csv'
if (Test-Path $outFile) { Remove-Item -ErrorAction Stop $outFile }

# Process all *.csv files in the current folder and merge their contents,
# skipping the header line for all but the first file.
$first = $true
Get-ChildItem -Filter *.csv | 
  Get-Content -Raw | 
    ForEach-Object {
      $content = 
        if ($first) { # first file: output content as-is
          $_; $first = $false
        } else { # subsequent file: skip the header line.
          $_ -replace '^.+\r?\n'
        }
      # Make sure that each file content ends in a newline
      if (-not $content.EndsWith("`n")) { $content += [Environment]::NewLine }
      $content # Output
    } | 
      Set-Content -NoNewLine $outFile # add -Encoding as needed.
mklement0
  • 382,024
  • 64
  • 607
  • 775
1

The modern Powershell 7 answer:
(Assuming all csv files are on the same directory and have the same amount of fields.)

@(Get-ChildItem -Filter *.csv).fullname | Import-Csv |Export-Csv ./merged.csv -NoTypeInformation

First part of the pipeline gets all the .csv files and parses the fullname (Path + filename + extension), then import CSV takes each and creates an object and then each object gets merged into a single CSV file with only one header.

francisco.l
  • 306
  • 1
  • 12
0

The following batch script is very fast. It should work well as long as none of your CSV files contain tab characters, and all source CSV files have fewer than 64k lines.

@echo off
set "skip="
>summary.txt (
  for %%F in (*.csv) do if defined skip (
    more +1 "%%F"
  ) else (
    more "%%F"
    set skip=1
  )
)

The reason for the restrictions is that MORE converts tabs into a series of spaces, and redirected MORE hangs at 64k lines.

dbenham
  • 127,446
  • 28
  • 251
  • 390
0

I found the previous solutions quite inefficient for large csv-files in terms of performance, so here is a performant alternative.

Here is an alternative which simply appends the files:

cmd /c copy  ((gci "YOUR_DIRECTORY\*.csv" -Name) -join '+') "YOUR_OUTPUT_FILE.csv" 

Thereafter, you probably want to get rid of the multiple csv-headers.

davidhigh
  • 14,652
  • 2
  • 44
  • 75
0
#Input path
$InputFolder = "W:\My Documents\... input folder"
$FileType    = "*.csv"

#Output path
$OutputFile  = "W:\My Documents\... some folder\merged.csv"

#Read list of files
$AllFilesFullName = @(Get-ChildItem -LiteralPath $InputFolder -Filter $FileType | Select-Object -ExpandProperty FullName)

#Loop and write 
Write-Host "Merging" $AllFilesFullName.Count $FileType "files."
foreach ($FileFullName in $AllFilesFullName) {
    Import-Csv $FileFullName | Export-Csv $OutputFile -NoTypeInformation -Append
    Write-Host "." -NoNewline
}

Write-Host
Write-Host "Merge Complete"
SUNIL KUMAR
  • 117
  • 5
-1
$pathin = 'c:\Folder\With\CSVs'
$pathout = 'c:\exported.txt'
$list = Get-ChildItem -Path $pathin | select FullName
foreach($file in $list){
    Import-Csv -Path $file.FullName | Export-Csv -Path $pathout -Append -NoTypeInformation
}
-5

type *.csv >> folder\combined.csv

Kent
  • 1