4

I'm trying to fix a CSV file that has a trailing ,\r\n in it. No matter what I do, it simply doesn't do anything. I tried putting the expression in [] which makes it replace every single comma. That implies that the issue is that it can't match the newline character.

I have saved the file with Windows line endings using Sublime Text, and have tried both variations of \r\n, \n\r, and just \n.

(Get-Content file.txt) | ForEach-Object { $_ -replace '\,\r\n', [System.Environmen
t]::NewLine } | Set-Content file2.txt

I'm using PowerShell version 5.1.15063.413

Jacobm001
  • 4,431
  • 4
  • 30
  • 51
  • 2
    `Get-Content file.txt | ForEach-Object { $_.TrimEnd(',') } | Set-Content file2.txt` ? or in short form `gc file.txt | % TrimEnd ',' | sc file2.txt` – TessellatingHeckler Jul 11 '17 at 21:42

2 Answers2

7

PowerShell turns out to be quite... special.

Get-Content by default returns an array of strings. It finds all new line characters and uses them to split the input into said array. Meaning there are no new lines present for the regexp to match.

A slight variation of this command using the -Raw parameter fixed my issue.

(Get-Content file.txt -Raw).replace(",`r`n", [System.Environment]::NewLine) | Set-Content file2.txt
Jacobm001
  • 4,431
  • 4
  • 30
  • 51
  • 2
    For people still stuck on PowerShell v2 the `-Raw` parameter is not available. Instead what they can do is read in the array, and re-join it with `(Get-Content file.txt) -join "\`n"` – TheMadTechnician Jul 12 '17 at 01:00
1

Indeed, Get-Content by defaults reads and emits the input file's content line by line, with newlines of any flavor - CRLF, LF, CR - stripped.

While the behavior may be unfamiliar, is generally helpful for processing files in the pipeline.

As your answer shows, -Raw can be used to read a file in full, as a single, multi-line string instead - which can offer great performance benefits.

To give an example of the convenience that line-by-line reading can provide, combined with the regex-based -replace operator's ability to operate on each element of an input array (if your file has LF (\n) endings and you're selectively looking for rogue CRLF (\r\n) line endings preceded by ,, that won't help, however):

# Convenient, but can be made faster with -ReadCount 0 - see below.
@(Get-Content file.txt) -replace ',$' | Set-Content file2.txt

Note: @(...), the array-subexpression operator, is used to ensure that the Get-Content call also outputs an array even if the file happens to have just one line.

Regex anchor $ matches the end of each input string (line), in effect removing a trailing , from each line, where present.


Get-Content performance notes:

As hinted at above, -Raw is by far the fastest way to read a text file in full - but by design as a single, multiline string.

The default behavior, line-by-line reading is slow, not least because PowerShell decorates each output line with metadata[1] (in the case of -Raw, given that there's only one output string, that happens only once).

However, you can speed things up by reading lines in batches - arrays of lines of a given size - using the -ReadCount parameter, in which case only each array, not the individual lines, are decorated. -ReadCount 0 reads all lines, into a single array.

Note:

  • -ReadCount changes the streaming behavior in the pipeline: Each array is then sent as a whole through the pipeline, which the receiving command needs to be plan for, typically by performing its own enumeration of the array received, such as with a foreach loop.

  • By contrast, using -ReadCount 0 in the context of an expression results in no behavioral difference, which means that it can be used as a simple performance optimization that requires no other parts of the expression to accommodate it; using an expression with a -replace operation as an example:

    # Read all lines directly into an array, with -ReadCount 0,
    # instead of more slowly letting PowerShell stream the lines 
    # (emit them one by one) and then collect them in an array for you.
    # The -replace operator then acts on each element of the array.
    (Get-Content -ReadCount 0 file.txt) -replace ',$'
    

Note: @(...) is not necessary in this case, because -ReadCount 0 always emits an array, even for single-line files.

A better-performing line-by-line-processing alternative - although it cannot directly be used as part of an expression - is to use the -switch statement with the -File parameter - see this answer for details.


[1] This metadata is provided in the form of ETS (Extended Type System) properties, which notably provide information about the line number and the path of the originating file. Pipe a Get-Content call to | Format-List -Force to see these properties. While this extra information can be helpful, the performance impact of attaching it is noticeable. Given that the information is often not needed, having a least an opt-out would be helpful: see GitHub issue #7537.

mklement0
  • 382,024
  • 64
  • 607
  • 775