2

I have several very large CSV (technically TSV) files that I need to append together. I had used:

copy file1.txt + file2.txt + ... + fileN.txt combined.txt

but then discovered that each file has a BOM at the start () which then appears multiple times in the middle of the file.

However, the files are very big (30-40 million lines each) so I can't open them in NotePad++ and re-save them to remove the BOMs, so need a command-line solution (either cmd or PowerShell), and ideally something that doesn't require downloading extra libraries.

To recap:

  • Files are too large to open in e.g. NotePad++, so solution needs to be for command line
  • This is on Windows, not *nix

(in my case N=4, so I could cope with a solution that removes the BOM from an individual file, and so run this for each file first before combining)

Edit: This may be a possible solution: Batch script remove BOM () from file but my knowledge of encodings and PowerShell/batch is so poor that I can't even tell if it's applicable or not! I don't mind if the combined file is ANSI or UTF-8 (I'm going to load it into a program that can cope with either), as long as it is internally consistent and correct.

Tim
  • 1,839
  • 10
  • 18
  • 1
    What's the encoding of the files you have? – Santiago Squarzon Apr 07 '22 at 22:27
  • 2
    I think it's UTF-8 - viewing the results of my `copy`-based attempt in a large file viewer utility that I have, the BOMs appear as `` which is apparently what UTF-8 BOM looks like when opened with ANSI encoding - I'm guessing this is what my large file viewer uses and there doesn't seem to be an option to change it. Or maybe ANSI is what `copy` uses... or maybe both? ‍♂️ (my understanding of these things is sadly limited) – Tim Apr 07 '22 at 22:40

2 Answers2

1

I gave up in the end and did it in Python (stick to what you know, eh?):

import shutil

with open("combined.txt", "w", encoding="utf-8") as wfd:
    for f in my_file_iterator():
        with open(f, "r", encoding="utf-8-sig") as fd:
            shutil.copyfileobj(fd, wfd)

replace my_file_iterator() with your chosen method or expression for looping over your files, e.g. something based on pathlib.Path.glob()

With help from these answers:

Tim
  • 1,839
  • 10
  • 18
1
  • Your own Python answer is likely the simplest and best-performing solution.

  • If you happen to have WSL installed, you can try the following to merge all file*.txt into combined.text while stripping the UTF-8 BOM from each (syntax is for calling from PowerShell):

    bash.exe -c 'for f in file*.txt; do tail -c +4 \"\$f\"; done > combined.txt'
    
    • tail -c +4 strips the first 3 bytes from each file and passes the remaining bytes through. Note that the entire output from a for loop can be captured by applying a redirection > to the for statement as a whole.

    • Note: Neither escaping " nor $ with \ should be necessary here, but is as of this writing:

      • Calling bash.exe with -c unexpectedly subjects the command string to up-front string interpolation, necessitating escaping; note that the following, seemingly equivalent call via wsl.exe -e does not exhibit this problem (which is why $f isn't escaped as \$f here):

         wsl.exe -e bash -c 'for f in file*.txt; do tail -c +4 \"$f\"; done > combined.txt'
        
      • Independently, up to at least PowerShell 7.2.2, PowerShell's argument passing to external programs is fundamentally broken with respect to empty arguments and arguments with embedded " chars., necessitating manual \-escaping - see this answer.


As for native PowerShell solutions:

  • The size of your files likely necessitates a memory-efficient streaming solution.

  • However, given the object-based nature of the PowerShell pipeline, with no raw byte support, this is likely to be prohibitively slow, especially with byte-by-byte processing, where each byte must be converted to and from a .NET [byte] object in memory.

For the record, here's the PowerShell solution, though it is likely too slow for large files:

# !! Unfortunately, the syntax for requesting byte-by-byte processing
# !! has changed between Windows PowerShell and PowerShell (Core) 7+,
$byteStreamParam = 
  if ($IsCoreClr) { @{ AsByteStream = $true } } 
  else            { @{ Encoding = 'Byte' } }

Get-ChildItem -Filter file*.txt |
  ForEach-Object {
    $_ | Get-Content @byteStreamParam | Select-Object -Skip 3
  } |
    Set-Content @byteStreamParam -LiteralPath combined.txt 

However, you can significantly improve the performance by using Get-Content's -ReadCount parameter to read the files in chunks (arrays of bytes). The larger the chunk size - memory permitting - the more runtime performance will improve:

$byteStreamParam = 
  if ($IsCoreClr) { @{ AsByteStream = $true } } 
  else            { @{ Encoding = 'Byte' } }

# How many bytes to read at a time.
$chunkSize = 256mb

Get-ChildItem -Filter file*.txt |
  ForEach-Object {
    $first = $true
    $_ | Get-Content @byteStreamParam -ReadCount $chunkSize | ForEach-Object {
      if ($first) { $_[3..($_.Count-1)]; $first = $false } 
      else        { $_ }
    }
  } |
    Set-Content @byteStreamParam -LiteralPath combined.txt 

Text-based PowerShell solutions:

Text-based solutions, while slower, have the advantage of enabling transcoding, i.e. transforming files from one character encoding to another, using the -Encoding parameter of Get-Content and Set-Content.

In the simplest case - if the individual files fit into memory as a whole (which may not work for you), you can use Get-Content's -Raw switch[1] to read a file's content into memory as a single, multi-line string, which is fast.

While a line-by-line streaming solution (omitting -Raw) is possible, it comes with caveats:

  • it will be slow, because PowerShell decorates each line read with metadata, which is both time-consuming and memory-intensive.

  • information about the input file's newline format (Windows-format CRLF vs. Unix-format LF) is invariably lost, including whether a trailing newline was present.

Note that Get-Content needs no -Encoding argument below, because both PowerShell editions directly recognize UTF-8 files with a BOM.

In Windows PowerShell, unfortunately, file-writing cmdlets do not support writing UTF-8 files without BOM - -Encoding utf8 invariable creates files with BOM, so assistance from .NET APIs is needed:

# Determine the output file, as a *full path*, because
# .NET's working dir. usually differs from PowerShell's.
$outFile = Join-Path ($PWD | Convert-Path) combined.text

# Create the output file, initially empty.
$null = New-Item -Path $outFile

Get-ChildItem -Filter file*.txt | 
  ForEach-Object {
    # BOM-less UTF-8 is the default.
    [IO.File]::AppendAllText($outFile, ($_ | Get-Content -Raw))
  }

In PowerShell (Core) 7+, not only does -Encoding utf8 now produce BOM-less UTF-8 files (you can request with-BOM files with -Encoding utf8bom), BOM-less UTF-8 is now the consistent default, so the solution simplifies to:

Get-ChildItem -Filter file*.txt | 
  Get-Content -Raw |
   Set-Content -LiteralPath combined.txt # BOM-less UTF-8 implied.

[1] Despite its name, this switch does not result in reading raw bytes. Instead, it bypasses the default line by line reading in favor of reading the entire file content at once, into a single string.

mklement0
  • 382,024
  • 64
  • 607
  • 775
  • This is really helpful, especially explaining the context of the "_object_-based nature of the PowerShell pipeline" in explaining why an efficient solution is hard there. – Tim Apr 13 '22 at 15:12
  • Glad to hear it, @Tim. – mklement0 Apr 13 '22 at 15:13