3

I have to parse a variable input-string into a string-array. The input is a CSV-style comma-separated field-list where each field has its own quoted string. Because I dont want to write my own full-blown CSV-parser the only working solution I could create till now is this one:

$input = '"Miller, Steve", "Zappa, Frank", "Johnson, Earvin ""Magic"""'

Add-Type -AssemblyName Microsoft.VisualBasic
$enc = [System.Text.Encoding]::UTF8
$bytes = $enc.GetBytes($input)
$stream = [System.IO.MemoryStream]::new($bytes)
$parser = [Microsoft.VisualBasic.FileIO.TextFieldParser]::new($stream)
$parser.Delimiters = ','
$parser.HasFieldsEnclosedInQuotes = $true
$list = $parser.ReadFields()

$list

Output looks like this:

Miller, Steve
Zappa, Frank
Johnson, Earvin "Magic"

Is there any better solution available via another .NET-library for Powersell? In best case I could avoid this extra bytes-array and stream. I am also not sure if this VisualBasic-Assembly will be avail on a long term.

Any ideas here?

Carsten
  • 1,612
  • 14
  • 21

2 Answers2

3

With some extra precautions for security and to prevent inadvertent string extrapolation, you can combine Invoke-Expression with Write-Output, though note that Invoke-Expression should generally be avoided:

$fieldList = '"Miller, Steve", "Zappa, Frank", "Johnson, Earvin ""Magic""", "Honey, I''m $HOME"'

# Parse into array.
$fields = (
  Invoke-Expression ("Write-Output -- " + ($fieldList -replace '\$', "`0"))
) -replace "`0", '$$'

Note:

  • -replace '\$', "`0" temporarily replaces literal $ chars. in the input with NUL chars. to prevent accidental (or malicious) string expansion (interpolation); the second -replace operation restores the original $ chars.
    See this answer for more information about the regex-based -replace operator.

  • Prepending Write-Output -- to the resulting string and interpreting the result as a PowerShell command via Invoke-Expression causes Write-Output to parse the remainder of the string as individual arguments and output them as such. -- ensures that any arguments that happen to look like Write-Output's own parameters are not interpreted as such.

  • If and only if the input string is guaranteed to never contain embedded $ characters, the solution can be simplified to:

    $fields = Invoke-Expression "Write-Output -- $fieldList" 
    

Outputting $fields yields the following:

Miller, Steve
Zappa, Frank
Johnson, Earvin "Magic"
Honey, I'm $HOME

Explanation and list of constraints:

The solution relies on making the input string part of a string whose content is a syntactically valid Write-Output call, with the input string serving as the latter's arguments. Invoke-Expression then evaluates this string as if its content had directly been submitted as a command and therefore executes the Write-Output command. Based on how PowerShell parses command arguments, this implies the following constraints:

  • Supported field separators:

    • Either: ,-separated (with per-field (unquoted) leading and/or trailing whitespace getting removed, as shown above).

    • Or: whitespace-separated, using one or more whitespace characters between the fields.

  • Non-/quoting of embedded fields:

    • Fields can be quoted:

      • If single-quoted ('...'), field-internal ' characters must be escaped as ''.

      • If double-quoted, field-internal " characters must be escaped as either "" or `".

    • Fields can also be unquoted:

      • However, such fields mustn't contain any PowerShell argument-mode metacharacters (of these, < > @ # are only metacharacters at the start of a token):

         <space> ' " ` , ; ( ) { } | & < > @ #        
        

Alternative, via ConvertFrom-Csv:

iRon's helpful answer shows a solution based on ConvertFrom-Csv, given that the field list embedded in the input string is comma-separated (,):

  • On the one hand, it is more limited in that it only supports "..."-quoting of fields and ""-escaping of field-internal ", and doesn't support fields separated by varying amounts of whitespace (only).

  • On the other hand, it is more flexible, in that it supports any single-character separator between the fields (irrespective of incidental leading/trailing per-field whitespace), which can be specified via the -Delimiter parameter.

What makes the solution awkward is the need to anticipate the max. number of embedded fields and to provide dummy headers (column names) for them (-Header (0..99)) in order to make ConvertFrom-Csv work, which is both fragile and potentially wasteful.

However, a simple trick can bypass this problem: Submit the input string twice, in which case ConvertFrom-Csv treats the fields in the input string as both the column names and as the column values of the one and only output row (object), whose values can then be queried:

$fieldList = '"Miller, Steve", "Zappa, Frank", "Johnson, Earvin ""Magic""", "Honey, I''m $HOME"'

# Creates the same array as the solution at the top.
$fields = ($fieldList, $fieldList | ConvertFrom-Csv).psobject.Properties.Value
mklement0
  • 382,024
  • 64
  • 607
  • 775
  • 1
    Submitting the input string twice is really a smart to resolve the unknown number of column names – iRon May 07 '21 at 15:00
2

If the list is limited, you might use the parser of the ConvertFrom-Csv cmdlet, like:

$List = '"Miller, Steve", "Zappa, Frank", "Johnson, Earvin ""Magic""", "Honey, I''m $HOME"'
($List | ConvertFrom-Csv -Header (0..99)).PSObject.Properties.Value.Where{ $Null -ne $_ }
Miller, Steve
Zappa, Frank
Johnson, Earvin "Magic"
Honey, I'm $HOME
iRon
  • 20,463
  • 10
  • 53
  • 79
  • +1 for the idea, though the `-Header (0..99)` part is a bit awkward. A simple solution is to supply the input string _twice_: `($List, $List | ConvertFrom-Csv).psobject.Properties.Value` – mklement0 May 07 '21 at 14:53