9

What flag can we pass to Get-Content to display control characters such as \r\n or \n?

What I am trying to do, is to determine whether the line endings of a file are in the Unix or Dos style. I have tried simply running Get-Content, which doesn't show any line ending. I have also tried using Vim with set list, which just shows the $ no matter what the line ending is.

I would like to do this with PowerShell, because that would be mighty useful.

mklement0
  • 382,024
  • 64
  • 607
  • 775
Shaun Luttin
  • 133,272
  • 81
  • 405
  • 467
  • 2
    `Get-Content $File | Out-String` usually works when you need the whole file as a string, but might change things. I suppose you could then use multiple `-replace` operators like `-replace "\`r", "\r"` to replace every special character you need. Alternately, iterate through every character, cast it to an `[Int]` and see if it's a control character. Might have problems with non-ASCII characters, so make sure you've got the right encoding. It might be easier in the long run to use [`System.IO.StreamReader`](http://msdn.microsoft.com/en-us/library/system.io.streamreader(v=vs.100).aspx). – Bacon Bits Dec 01 '14 at 17:15
  • @BaconBits this looked more like an awesome answer than a comment ;) – Micky Balladelli Dec 01 '14 at 18:19
  • 1
    @MickyBalladelli Maybe, but I haven't actually tested doing what the question has asked, so I don't know if it actually will work. For example, I'm not sure how `Get-Content` might change control characters when it pipes to `Out-String` if it gets the file as a string array and then pipes the array to another cmdlet. The question is also lacking details and contains nothing about what the was tried or errors that occurred. It's not even clear what he means by "display." What if he's looking for badly-formed line endings? I'm a bit leery of answering a somewhat poor question. – Bacon Bits Dec 01 '14 at 18:31
  • @BaconBits I added more details about what I have tried and what I am trying to do. – Shaun Luttin Dec 02 '14 at 02:27

3 Answers3

8

One way is to use Get-Content's -Encoding parameter e.g.:

Get-Content foo.txt -Encoding byte | % {"0x{0:X2}" -f $_}

If you have the PowerShell Community Extensions, you can use the Format-Hex command:

Format-Hex foo.txt

Address:  0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F ASCII
-------- ----------------------------------------------- ----------------
00000000 61 73 66 09 61 73 64 66 61 73 64 66 09 61 73 64 asf.asdfasdf.asd
00000010 66 61 73 0D 0A 61 73 64 66 0D 0A 61 73 09 61 73 fas..asdf..as.as

If you really want to see "\r\n" in the output than do what BaconBits suggests but you have to use the -Raw parameter e.g.:

(Get-Content foo.txt -Raw) -replace '\r','\r' -replace '\n','\n' -replace '\t','\t'

Outputs:

asf\tasdfasdf\tasdfas\r\nasdf\r\nas\tasd\r\nasdfasd\tasf\tasdf\t\r\nasdf
Keith Hill
  • 194,368
  • 42
  • 353
  • 369
  • `(Get-Content foo.txt -Raw) -replace '\r','\r' -replace '\n','\n' -replace '\t','\t'` does it. It's very interesting that we can usefully replace `\r` with `\r`. – Shaun Luttin Dec 02 '14 at 02:30
  • ++; just to provide historical context and an update: `Get-Content -Raw` requires PSv3+; PSv5+ ships with its own [`Format-Hex` cmdlet](https://msdn.microsoft.com/en-us/powershell/reference/6/microsoft.powershell.utility/format-hex). – mklement0 Jul 27 '17 at 17:01
7

Below is custom function Debug-String, which visualizes control characters in strings:

  • where available, using PowerShell's own `-prefixed escape-sequence notation (e.g., `r for CR), where a native PowerShell escape is available,

  • falling back to caret notation (e.g., the ASCII-range control character with code point 0x4 - END OF TRANSMISSION - is represented as ^D).

    • Alternatively, you can use the -CaretNotation switch to represent all ASCII-range control characters in caret notation, which gives you output similar to cat -A on Linux and cat -et on macOS/BSD.
  • all other control characters, namely those outside the ASCII range (the ASCII range spanning code points 0x0 - 0x7F) are represented in the form `u{<hex>}, where <hex> is the hex. representation of the code point with up to 6 digits; e.g., `u{85} is Unicode char. U+0085, the NEXT LINE control char.; this notation is now also supported in expandable strings ("..."), but only in PowerShell Core.

Applied to your use case, you'd use (requires PSv3+, due to use of Get-Content -Raw to ensure the file is read as a whole; without it, information about the line endings would be lost):

Get-Content -Raw $file | Debug-String

Two simple examples:


Using PowerShell's escape-sequence notations. Note that this only looks like a no-op: the `-prefixed sequences inside "..." strings create actual control characters.

PS> "a`ab`t c`0d`r`n" | Debug-String
a`ab`t c`0d`r`n

Using -CaretNotation, with output similar to cat -A on Linux:

PS> "a`ab`t c`0d`r`n" | Debug-String -CaretNotation
a^Gb^I c^@d^M$

Debug-String source code:

Note: The function below is also available as an MIT-licensed Gist with additional functionality, notably showing spaces as · and the option to show non-ASCII characters as escape sequences (-UnicodeEscapes), and the option to print a string as a PowerShell string literal (-AsSourceCode). Only the Gist will be maintained going forward.

Assuming you have looked at the linked code to ensure that it is safe (which I can personally assure you of, but you should always check), you can install it directly as follows:

irm https://gist.github.com/mklement0/7f2f1e13ac9c2afaf0a0906d08b392d1/raw/Debug-String.ps1 | iex
Function Debug-String {
  param(
    [Parameter(ValueFromPipeline, Mandatory)]
    [string] $String
    ,
    [switch] $CaretNotation
  )

  begin {
    # \p{C} matches any Unicode control character, both inside and outside
    # the ASCII range; note that tabs (`t) are control character too, but not spaces.
    $re = [regex] '\p{C}'
  }

  process {

    $re.Replace($String, {
      param($match)
      $handled = $False
      if (-not $CaretNotation) {
        # Translate control chars. that have native PS escape sequences into them.
        $handled = $True
        switch ([Int16] [char] $match.Value) {
          0  { '`0'; break }
          7  { '`a'; break }
          8  { '`b'; break }
          12 { '`f'; break }
          10 { '`n'; break }
          13 { '`r'; break }
          9  { '`t'; break }
          11 { '`v'; break }
          default { $handled = $false }
        } # switch
      }
      if (-not $handled) {
          switch ([Int16] [char] $match.Value) {
            10 { '$'; break } # cat -A / cat -e visualizes LFs as '$'
            # If it's a control character in the ASCII range, 
            # use caret notation too (C0 range).
            # See https://en.wikipedia.org/wiki/Caret_notation
            { $_ -ge 0 -and $_ -le 31 -or $_ -eq 127 } {
              # Caret notation is based on the letter obtained by adding the
              # control-character code point to the code point of '@' (64).
              '^' + [char] (64 + $_)
              break
            }
            # NON-ASCII control characters; use the - PS Core-only - Unicode
            # escape-sequence notation:
            default { '`u{{{0}}}' -f ([int16] [char] $_).ToString('x') }
          }
      } # if (-not $handled)
    })  # .Replace
  } # process

}

For brevity I haven't included the comment-based help above; here it is:

<#
.SYNOPSIS
Outputs a string in diagnostic form.

.DESCRIPTION
Prints a string with normally hidden control characters visualized.

Common control characters are visualized using PowerShell's own escaping 
notation by default, such as
"`t" for a tab, "`n" for a LF, and "`r" for a CR.

Any other control characters in the ASCII range (C0 control characters)
are represented in caret notation (see https://en.wikipedia.org/wiki/Caret_notation).

If you want all ASCII range control characters visualized using caret notation,
except LF visualized as "$", similiar to `cat -A` on Linux, for instance, 
use -CaretNotation.

Non-ASCII control characters are visualized by their Unicode code point
in the form `u{<hex>}, where <hex> is the hex. representation of the
code point with up to 6 digits; e.g., `u{85} is U+0085, the NEXT LINE
control char.

.PARAMETER CaretNotation
Causes LF to be visualized as "$" and all other ASCII-range control characters
in caret notation, similar to `cat -A` on Linux.

.EXAMPLE
PS> "a`ab`t c`0d`r`n" | Debug-String
a`ab`t c`0d`r`n

.EXAMPLE
PS> "a`ab`t c`0d`r`n" | Debug-String -CaretNotation
a^Gb^I c^@d^M$
#>
mklement0
  • 382,024
  • 64
  • 607
  • 775
3

Here's one way using a regular expression replacement:

function Printable([string] $s) {
    $Matcher = 
    {  
      param($m) 

      $x = $m.Groups[0].Value
      $c = [int]($x.ToCharArray())[0]
      switch ($c)
      {
          9 { '\t' }
          13 { '\r' }
          10 { '\n' }
          92 { '\\' }
          Default { "\$c" }
      }
    }
    return ([regex]'[^ -~\\]').Replace($s, $Matcher)
}

PS C:\> $a = [char[]](65,66,67, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)

PS C:\> $b = $a -join ""

PS C:\> Printable $b
ABC\1\2\3\4\5\6\7\8\t\n\11\12\r
Duncan
  • 92,073
  • 11
  • 122
  • 156